Commit graph

77 commits

Author SHA1 Message Date
Wang, Eikan
e5a5cd149f Simplify IfThenElse and CompareSelect within for-loop (#76793)
Analyze the range to determine if a condition cannot be satisfied. Suppose the for-loop body contains `IfThenElse` or `CompareSelect` while the condition of the two statements depends on the for-loop index `Var`. In that case, we will analyze the range to check whether the condition could always be satisfied or not. If the condition is deterministic, simplify the logic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76793
Approved by: https://github.com/huiguoo
2022-05-15 20:21:28 +00:00
Mikhail Zolotukhin
1855b14922 [TensorExpr] Delet DimArg class. (#72390)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72390

This class didn't add much value and only caused more boilerplate code.
This change removes the class and updates all the use cases with
uses of `ExprHandle`.

A side effect of this change is different names in loop variables, which
caused massive mechanical changes in our tests.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D34030296

Pulled By: ZolotukhinM

fbshipit-source-id: 2ba4e313506a43ab129a10d99e72b638b7d40108
(cherry picked from commit c2ec46a0587cafd4e915c5bf1e0dc0b5d244e8d5)
2022-02-11 01:21:59 +00:00
Richard Barnes
e0643fa3fc use irange for loops 5 (#66744)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66744

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31705358

fbshipit-source-id: d6ea350cbaa8f452fc78f238160e5374be637a48
2021-10-18 21:59:50 -07:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Mikhail Zolotukhin
f23f21dafe [TensorExpr] Remove 'Placeholder' class. (#64887)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64887

BufHandle has exactly the same functionality and should be used instead.

Differential Revision:
D30889483
D30889483

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 365fe8e396731b88920535a3de96bd3301aaa3f3
2021-09-14 00:22:44 -07:00
Hui Guo
4481c87ac4 [tensorexpr] Simplify x/100 -> 0 if x is a non-negative integer less than 100. (#64763)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64763

Simplification pattern:
  x/N -> 0; N is a constant positive integer and x is a for-loop index whose range is a subset of [0, N).

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D30845854

Pulled By: huiguoo

fbshipit-source-id: 814d69ed4be05e57405c222183cc1c6c526721cd
2021-09-10 20:33:02 -07:00
Mikhail Zolotukhin
f0d274294d [TensorExpr] Nuke KernelArena and KernelScope. (#63587)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63587

Now that there is no classes using KernelArena for memory management we
can remove it.

Differential Revision:
D30429115
D30429115

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 375f6f9294d27790645eeb7cb5a8e87047a57544
2021-08-24 00:32:16 -07:00
Mikhail Zolotukhin
62d02f2b57 [TensorExpr] Make 'Tensor' a value type. (#63586)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63586

This is another commit in transition from KernelArena memory management.
Tensor is essentially just a pair of <BufPtr, StmtPtr> and we don't need
to dynamically allocate it at all - it's cheap to pass it by value, and
that's what we're switching to in this commit.

After this change nothing uses KernelScope/KernelArena and they can be
safely removed.

Differential Revision:
D30429114
D30429114

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: f90b859cfe863692b7beffbe9bd0e4143df1e819
2021-08-24 00:32:13 -07:00
Mikhail Zolotukhin
7fdba4564a [TensorExpr] IRSimplifier: sort terms in polynomials, terms, minterms, maxterms. (#63197)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63197

This solves non-determinism from using hash values in sort methods.
Changes in tests are mostly mechanical.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292776

Pulled By: ZolotukhinM

fbshipit-source-id: 74f57b53c3afc9d4be45715fd74781271373e055
2021-08-18 14:49:27 -07:00
Mikhail Zolotukhin
1dc2b52764 [TensorExpr] Add a wrapper for all expr and stmt pointers. (#63195)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/63195

This helps us to later switch from using KernelArena with raw pointers
to shared pointers without having to change all our source files at
once.

The changes are mechanical and should not affect any functionality.

With this PR, we're changing the following:
 * `Add*` --> `AddPtr`
 * `new Add(...)` --> `alloc<Add>(...)`
 * `dynamic_cast<Add*>` --> `to<Add>`
 * `static_cast<Add*>` --> `static_to<Add>`

Due to some complications with args forwarding, some places became more
verbose, e.g.:
 * `new Block({})` --> `new Block(std::vector<ExprPtr>())`

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D30292779

Pulled By: ZolotukhinM

fbshipit-source-id: 150301c7d2df56b608b035827b6a9a87f5e2d9e9
2021-08-17 13:44:45 -07:00
Raghavan Raman
59dd12042e [nnc] Removed const from all fields in IR. (#62336)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62336

This PR was generated by removing `const` for all types of nodes in NNC IR, and fixing compilation errors that were the result of this change.

This is the first step in making all NNC mutations in-place.

Test Plan: Imported from OSS

Reviewed By: iramazanli

Differential Revision: D30049829

Pulled By: navahgar

fbshipit-source-id: ed14e2d2ca0559ffc0b92ac371f405579c85dd63
2021-08-03 11:44:36 -07:00
Hui Guo
3a592730d5 [nnc] Simplify i%100 to i if i is less than 100; fixed #52580 (#60693)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60693

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29375938

Pulled By: huiguoo

fbshipit-source-id: 1388729c5b93805cb156efa53e8823d5462885bf
2021-08-02 18:38:54 -07:00
Hui Guo
8f7ae77040 [nnc] Add context-sensitive simplification for div/mod (#60688)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60688

Test Plan: Imported from OSS

Reviewed By: navahgar, ZolotukhinM

Differential Revision: D29373313

Pulled By: huiguoo

fbshipit-source-id: 90d7f2fbfce583b0ea3b0f1c7899e22b0210bd62
2021-08-02 18:37:39 -07:00
Nikita Shulga
a9b0a921d5 Disable avoid-non-const-global-variables lint check (#62008)
Summary:
As GoogleTest `TEST` macro is non-compliant with it as well as `DEFINE_DISPATCH`

All changes but the ones to `.clang-tidy` are generated using following script:
```
for i in `find . -type f -iname "*.c*" -or -iname "*.h"|xargs grep cppcoreguidelines-avoid-non-const-global-variables|cut -f1 -d:|sort|uniq`;  do sed -i "/\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-non-const-global-variables)/d" $i; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/62008

Reviewed By: driazati, r-barnes

Differential Revision: D29838584

Pulled By: malfet

fbshipit-source-id: 1b2f8602c945bd4ce50a9bfdd204755556e31d13
2021-07-22 18:04:40 -07:00
Raghavan Raman
843c42ffd8 [nnc] Refactored test macros and updated compress buffer tests to use them (#61716)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/61716

Test Plan: Imported from OSS

Reviewed By: ZolotukhinM

Differential Revision: D29715754

Pulled By: navahgar

fbshipit-source-id: c400a58b7f393c0f93e5a25f118403124f8834b0
2021-07-15 21:17:14 -07:00
Raghavan Raman
34d6618386 [NNC] Fixing a bug in simplifier (#58291)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/58291

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D28435393

Pulled By: navahgar

fbshipit-source-id: 517e47385a93a43d2ddf054382adc81c18484066
2021-05-18 01:28:33 -07:00
CodemodService FBSourceClangFormatLinterBot
cbfce376a8 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D28319469

fbshipit-source-id: 8295597a8ee16b2fef3f7aacdd6c892cb22db988
2021-05-10 03:39:31 -07:00
Nikita Shulga
3a66a1cb99 [clang-tidy] Exclude cppcoreguidelines-avoid-magic-numbers (#57841)
Summary:
Add cppcoreguidelines-avoid-magic-numbers exclusion to clang-tidy
Remove existing nolint warnings using following script:
```
for file in `git ls-files | grep -v \.py`; do gsed '/^ *\/\/ NOLINTNEXTLINE(cppcoreguidelines-avoid-magic-numbers)/d' -i  $file; done
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/57841

Reviewed By: samestep

Differential Revision: D28295045

Pulled By: malfet

fbshipit-source-id: 7c6e8d1213c9593f169ed3df6a916498f1a97163
2021-05-07 20:02:33 -07:00
Nikita Shulga
4cb534f92e Make PyTorch code-base clang-tidy compliant (#56892)
Summary:
This is an automatic change generated by the following script:
```
#!/usr/bin/env python3
from subprocess import check_output, check_call
import os

def get_compiled_files_list():
    import json
    with open("build/compile_commands.json") as f:
        data = json.load(f)
    files = [os.path.relpath(node['file']) for node in data]
    for idx, fname in enumerate(files):
        if fname.startswith('build/') and fname.endswith('.DEFAULT.cpp'):
            files[idx] = fname[len('build/'):-len('.DEFAULT.cpp')]
    return files

def run_clang_tidy(fname):
    check_call(["python3", "tools/clang_tidy.py", "-c", "build", "-x", fname,"-s"])
    changes = check_output(["git", "ls-files", "-m"])
    if len(changes) == 0:
        return
    check_call(["git", "commit","--all", "-m", f"NOLINT stubs for {fname}"])

def main():
    git_files = check_output(["git", "ls-files"]).decode("ascii").split("\n")
    compiled_files = get_compiled_files_list()
    for idx, fname in enumerate(git_files):
        if fname not in compiled_files:
            continue
        if fname.startswith("caffe2/contrib/aten/"):
            continue
        print(f"[{idx}/{len(git_files)}] Processing {fname}")
        run_clang_tidy(fname)

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/56892

Reviewed By: H-Huang

Differential Revision: D27991944

Pulled By: malfet

fbshipit-source-id: 5415e1eb2c1b34319a4f03024bfaa087007d7179
2021-04-28 14:10:25 -07:00
Mikhail Zolotukhin
1263448cb2 [TensorExpr] Remove mask field from Load and Store classes. (#55825)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55825

The mask has never been used (in vectorization we generate an explicit
`IfThenElse` construct when we need to mask out some elements). The PR
removes it and cleans up all its traces from tests.

Differential Revision: D27717776

Test Plan: Imported from OSS

Reviewed By: navahgar

Pulled By: ZolotukhinM

fbshipit-source-id: 41d1feeea4322da75b3999d661801c2a7f82b9db
2021-04-13 12:08:51 -07:00
Bert Maher
7367bca066 [nnc] Tests for proposed feature: loop bounds conditional simplification (#54121)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/54121

It would be nice to do range analysis to determine if a condition
cannot be satisfied.  These are some tests that we should be able to turn on
once we have this feature.
ghstack-source-id: 124116847

Test Plan: Simplify.*LoopBounds

Reviewed By: ZolotukhinM

Differential Revision: D27107956

fbshipit-source-id: bb27e3d3bc803f0101c416e4a351ba2278684980
2021-03-17 11:01:10 -07:00
Hui Guo
8737c2a1a2 [TensorExpr] Reland: "Simplify index expressions constructed in loop flattening. Fixes #51173" (#53861)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/53861

Replaced the iterators in the for-loops with integer index variables due to
overflow when handling empty vectors.

Test Plan: Imported from OSS

Reviewed By: navahgar

Differential Revision: D26998894

Pulled By: huiguoo

fbshipit-source-id: a1f6475c8ba123968ef7247b4f6f38edbf24b9ef
2021-03-11 23:52:36 -08:00
Edward Yang
07d315fce8 Revert D26676150: Simplify index expressions constructed in loop flattening - #51173
Test Plan: revert-hammer

Differential Revision:
D26676150 (1f01899e4a)

Original commit changeset: e202e0c8610e

fbshipit-source-id: 9611dda6897b67e16e44c731994bc9e5fccab0b9
2021-03-11 07:17:38 -08:00
Hui Guo
1f01899e4a Simplify index expressions constructed in loop flattening - #51173 (#52882)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/52882

Test Plan:
Imported from OSS

build/bin/test_tensorexpr

Reviewed By: ZolotukhinM

Differential Revision: D26676150

Pulled By: huiguoo

fbshipit-source-id: e202e0c8610eb107558a3add8a6560a0cb97704a
2021-03-10 18:37:42 -08:00
Mikhail Zolotukhin
d3b427a0e3 [TensorExpr] Add an unmasked Load constructor. (#52790)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52790

Fixes #52774.

Test Plan: Imported from OSS

Reviewed By: bertmaher

Differential Revision: D26649542

Pulled By: ZolotukhinM

fbshipit-source-id: ab1c9e55f52e59d0bd00fbde2ec3125f8c7917ee
2021-02-24 22:45:29 -08:00
Hui Guo
973e306c84 changed TE 'Allocate' API to take one argument 'Buf' instead of three arguments 'Var', 'dtype', 'dims'. (#50167)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/50167

Test Plan:
Imported from OSS

`python test/test_jit_fuser_te.py`
`python test/test_jit_fuser_legacy.py`
`python test/test_jit_fuser.py`
`build/bin/test_tensorexpr`

Reviewed By: ZolotukhinM

Differential Revision: D25814342

Pulled By: huiguoo

fbshipit-source-id: 44cba7f92365b826c9cb1d385a94858934570dee
2021-02-22 15:08:51 -08:00
Andres Suarez
8530c65e25 [codemod][fbcode/caffe2] Apply clang-format update fixes
Test Plan: Sandcastle and visual inspection.

Reviewed By: igorsugak

Differential Revision: D25849205

fbshipit-source-id: ef664c1ad4b3ee92d5c020a5511b4ef9837a09a0
2021-01-09 14:37:36 -08:00
Mikhail Zolotukhin
a5b27d7a31 [TensorExpr] Move SimpleIREval implementation from .h to .cpp. (#49697)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49697

Mostly mechanical move. This refactoring helps to hide unnecessary
details from the SimpleIREval interface and make it more similar to a
pure 'codegen'.

Test Plan: Imported from OSS

Reviewed By: nickgg

Differential Revision: D25668696

Pulled By: ZolotukhinM

fbshipit-source-id: 423247bfcdfa88403e8ec92152f00110bb9da19c
2020-12-21 20:20:15 -08:00
Raghavan Raman
46c9a0e679 Do not use negative values in GCD computation. (#49379)
Summary:
GCD should always return positive integers. When negative values are used, we hit a corner case that results in an infinite recursion during simplification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/49379

Reviewed By: ezyang

Differential Revision: D25597115

Pulled By: navahgar

fbshipit-source-id: b0e8ac07ee50a5eb775c032628d4840df7424927
2020-12-21 15:08:43 -08:00
Peng Wu
6568572712 Support integral types for kAbs in SimpleIREvaluator (#49357)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49357

This is a follow-up fix for PR #48679, where the previous PR
adds support for integer inputs to aten::abs by promoting integers to
float and then demote the result back to integers. This PR supports
integer inputs to aten::abs more efficiently in the SimpleIREvaluator
by allowing implementing integer inputs for kAbs (renamed from kFabs).
- Rename kFabs to kAbs
- Add support for integer input to kAbs in SimpleIREvalator (note that:
llvm_codegen and cuda_codegen already supports integer inputs to kAbs)

Test Plan:
- `PYTORCH_TENSOREXPR_DONT_USE_LLVM=1 python test/test_jit_fuser_te.py
TestTEFuser.test_unary_ops`
- `python test/test_jit_fuser_te.py TestTEFuser.test_unary_ops`

Imported from OSS

Reviewed By: eellison

Differential Revision: D25545791

fbshipit-source-id: e52f51a352d149f66ce8341fb3beb479be08a230
2020-12-18 07:57:58 -08:00
Bert Maher
07657b6001 [tensorexpr] Switch cpp tests to pure gtest (#48160)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48160

We no longer use the custom c++ test infra anyways, so move to pure
gtest.

Fixes #45703
ghstack-source-id: 116977283

Test Plan: `buck test //caffe2/test/cpp/tensorexpr`

Reviewed By: navahgar, nickgg

Differential Revision: D25046618

fbshipit-source-id: da34183d87465f410379048148c28e1623618553
2020-11-18 12:23:34 -08:00
Nick Gibson
17f8c329df [NNC] IRSimplifier rules for Compare and Mod (#46412)
Summary:
Adds new rules to the NNC IRSimplifier to take care of the following cases:

* Comparisons which are symbolic but have a constant difference. E.g. this is most useful in cases like `if (x > x + 4) ...` which we can now eliminate.

* Simplification of `Mod` nodes, including simple rules such as `0 % x` and `x % 1`, but also factorization of both sides to find common symbolic multiples. E.g. `(x * y) % x` can be cancelled out to `0`.

See tests for many more examples!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46412

Reviewed By: navahgar

Differential Revision: D24396151

Pulled By: nickgg

fbshipit-source-id: abb954dc930867d62010dcbcd8a4701430733715
2020-10-19 19:37:09 -07:00
Nick Gibson
2fa91fa305 [NNC] Fix crash when simplifying certain subtractions (#46108)
Summary:
Fixes a crash bug in the IRSimplifier when the LHS is a Term (e.g. 2x) and the RHS is a Polynomial (e.g. 2x+1).

This case crashes 100% of the time so I guess it's not very common in models we've been benchmarking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46108

Reviewed By: agolynski

Differential Revision: D24226593

Pulled By: nickgg

fbshipit-source-id: ef454c855ff472febaeba16ec34891df932723c0
2020-10-09 15:15:55 -07:00
Mikhail Zolotukhin
4aca63d38a [TensorExpr] Change API for creating Load and Store expressions. (#45520)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45520

With this change `Load`s and `Store`s no longer accept `Placeholder`s in
their constructor and `::make` functions and can only be built with
`Buf`.
`Placeholder` gets its own `store`, `load`, `storeWithMask`, and
`loadWithMask` method for more convenient construction.

Test Plan: Imported from OSS

Reviewed By: glaringlee

Differential Revision: D23998789

Pulled By: ZolotukhinM

fbshipit-source-id: 3fe018e00c1529a563553b2b215f403b34aea912
2020-09-29 20:52:38 -07:00
Mikhail Zolotukhin
3c33695a6d [TensorExpr] Rename Buffer to Placeholder. (#45389)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/45389

Differential Revision: D23952866

Test Plan: Imported from OSS

Reviewed By: nickgg

Pulled By: ZolotukhinM

fbshipit-source-id: 17eedd3ac17897501403482ac1866c569d247c75
2020-09-29 01:21:54 -07:00
Alex Suhan
76c185dcca [TensorExpr] When lanes differ, insert Broadcast instead of Cast (#45179)
Summary:
We need to check if dtypes differ in scalar type or lanes to decide between
Cast and Broadcast.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45179

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyBroadcastTermExpander

Reviewed By: bwasti

Differential Revision: D23873316

Pulled By: asuhan

fbshipit-source-id: ca141be67e10c2b6c5f2ff9c11e42dcfc62ac620
2020-09-23 17:06:54 -07:00
Alex Suhan
215679573e [TensorExpr] Fix operator order in combineMultilane (#45157)
Summary:
combineMultilane used the wrong order when ramp was on the left hand side,
which matters for subtract.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45157

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.SimplifyRampSubBroadcast

Reviewed By: ailzhang

Differential Revision: D23851751

Pulled By: asuhan

fbshipit-source-id: 864d1611e88769fb43327ef226bb3310017bf858
2020-09-22 23:50:47 -07:00
Nick Gibson
4bbb6adff5 [NNC] fix SyncThreads insertion and reenable CudaSharedMem test (#44909)
Summary:
A previous fix for masking Cuda dimensions (https://github.com/pytorch/pytorch/issues/44733) changed the behaviour of inserting thread synchronization barriers in the Cuda CodeGen, causing the CudaSharedMemReduce_1 to be flaky and ultimately disabled.

The issue is working out where these barriers must be inserted - solving this optimally is very hard, and I think not possible without dependency analysis we don't have, so I've changed our logic to be quite pessimistic. We'll insert barriers before and after any blocks that have thread dimensions masked (even between blocks that have no data dependencies). This should be correct, but it's an area we could improve performance. To address this somewhat I've added a simplifier pass that removes obviously unnecessary syncThreads.

To avoid this test being flaky again, I've added a check against the generated code to ensure there is a syncThread in the right place.

Also fixed a couple of non-functional but clarity issues in the generated code: fixed the missing newline after Stores in the CudaPrinter, and prevented the PrioritizeLoad mutator from pulling out loads contained within simple Let statements (such as those produced by the Registerizer).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44909

Reviewed By: agolynski

Differential Revision: D23800565

Pulled By: nickgg

fbshipit-source-id: bddef1f40d8d461da965685f01d00b468d8a2c2f
2020-09-21 09:27:22 -07:00
Nick Gibson
f175830558 [NNC] Fuse identical conditions in simplifier (#44886)
Summary:
Adds a pass to the IR Simplifier which fuses together the bodies of Cond statements which have identical conditions. e.g.

```
if (i < 10) {
  do_thing_1;
} else {
  do_thing_2;
}
if (i < 10) {
  do_thing_3;
}
```

is transformed into:

```
if (i < 10) {
  do_thing_1;
  do_thing_3;
} else {
  do_thing_2;
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44886

Reviewed By: glaringlee

Differential Revision: D23768565

Pulled By: nickgg

fbshipit-source-id: 3fe40d91e82bdfff8dcb8c56a02a4fd579c070df
2020-09-18 11:38:03 -07:00
Nick Gibson
204f985fc3 [NNC] Add simplification of Loop + Condition patterns. (#44764)
Summary:
Adds a new optimization to the IRSimplifier which changes this pattern:
```
for ...
  if ...
   do thing;
```
into:
```
if ...
  for ...
    do thing;
```

Which should be almost strictly better.

There are many cases where this isn't safe to do, hence tests. Most  obviously when the condition depends on something modified within the loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44764

Reviewed By: mruberry

Differential Revision: D23734463

Pulled By: nickgg

fbshipit-source-id: 51617e837de96b354fb702d0090ac65ddc523d36
2020-09-16 18:41:58 -07:00
Alex Suhan
7b3432caff [TensorExpr] Support boolean in simplifier (#44659)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44659

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConstantFoldCastToBool

Reviewed By: ngimel

Differential Revision: D23714675

Pulled By: asuhan

fbshipit-source-id: 4c18d972b628d5ad55bad58eddd5f6974e043d9c
2020-09-16 15:30:19 -07:00
Nick Gibson
69839ea3f6 [NNC] make inlining immediate (take 3) (#44231)
Summary:
This is a reup https://github.com/pytorch/pytorch/issues/43885 with an extra commit which should fix the bugs that caused it to be reverted. Read that for general context.

The issue here was that we were still using the side maps `tensor_to_stmt_` and `stmt_to_tensor_` which get invalidated by any transform of the IR (rather than just any transform that isn't computeInline). I added a comment about this but didn't actually address our usages of it.

I've removed these maps and changed the `getLoopBodyFor` and `getLoopStatementsFor` helpers to search the root stmt directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44231

Reviewed By: albanD

Differential Revision: D23689688

Pulled By: nickgg

fbshipit-source-id: 1c6009a880f8c0cebf2300fd06b5cc9322bffbf9
2020-09-15 11:12:24 -07:00
Raghavan Raman
ad7a2eb1c9 Simplify nested Min and Max patterns. (#44142)
Summary:
Improve simplification of nested Min and Max patterns.

Specifically, handles the following pattern simplications:
  * `Max(A, Max(A, Const)) => Max(A, Const)`
  * `Max(Min(A, B), Min(A, C)) => Min(A, Max(B, C))`
  * `Max(Const, Max(A, OtherConst) => Max(A, Max(Const, OtherConst))`
     - This case can have an arbitrarily long chain of Max ops. For example: `Max(5, Max(x, Max(y, Max(z, 8)))) => Max(Max(Max(x, 8), y), z)`

Similarly, for the case of Min as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/44142

Reviewed By: albanD

Differential Revision: D23644486

Pulled By: navahgar

fbshipit-source-id: 42bd241e6c2af820566744c8494e5dee172107f4
2020-09-14 13:24:46 -07:00
Bert Maher
6d4a605ce9 Fix bug simplifying if-then-else when it can be removed (#44462)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44462

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D23671157

Pulled By: bertmaher

fbshipit-source-id: b9b92ad0de1a7bd9bc1fcac390b542d885d0ca58
2020-09-13 10:29:28 -07:00
Mikhail Zolotukhin
6474057c76 Revert D23503636: [pytorch][PR] [NNC] make inlining immediate (take 2) and fix bugs
Test Plan: revert-hammer

Differential Revision:
D23503636 (70aecd2a7f)

Original commit changeset: cdbdc902b7a1

fbshipit-source-id: b5164835f874a56213de4bed9ad690164eae9230
2020-09-04 10:58:23 -07:00
Nick Gibson
70aecd2a7f [NNC] make inlining immediate (take 2) and fix bugs (#43885)
Summary:
A rework of `computeInline` which makes it work a bit better, particularly when combined with other transformations. Previously we stored Functions that were inlined and then deferred the actual inlining of the function body until prepareForCodgen was called. This has an issue when transformations are applied to the LoopNest: the function body can be different from what appears in the root_stmt and result in inlining that a) fails, b) reverses other transformations or c) a weird unpredictable combination of the two.

This PR changes that behaviour so that the inlining occurs in the root stmt immediately, which means it reflects any previous transformations and any future transformations have a true view of the internal IR. It also has the benefit that inspecting the root statement gives an accurate view of it without needing to call prepareForCodgen. I also removed the difference between `computeInline` and `computeInlineWithRand` and we handle calls to `rand()` in all branches.

This is a rework of https://github.com/pytorch/pytorch/issues/38696, with the agreed changes from ZolotukhinM and zheng-xq: we should only inline if the dimensions are trivial (ie. they are vars not exprs).

This PR is mostly tests, and I fixed a bunch of bugs I found along the way. Partial list:
* When inlining an expression involving rand, we would create random vars equal to the dimensionality of the enclosing Tensor not the produced Tensor - meaning we'd use an incorrect value if the inlined tensor was smaller. E.g: `X[i] = rand(); A[i, j] = X[i]` would produce a tensor where `A[0, 0] != A[0, 1]`. This is fixed by inserting the Let binding of the random variable at the correct loop body.
* When inlining we'd replace all calls to `rand()` rather than just those present in the Tensor being inlined.
* `rand()` was treated symbolically by the simplifier and we would aggregate or cancel calls to `rand()`. Have fixed the hasher to hash all calls to `rand()` distinctly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43885

Reviewed By: gmagogsfm

Differential Revision: D23503636

Pulled By: nickgg

fbshipit-source-id: cdbdc902b7a14d269911d978a74a1c11eab004fa
2020-09-03 16:49:24 -07:00
Nick Gibson
1390cad2d8 [NNC] Hook up registerizer to Cuda codegen [2/x] (#42878)
Summary:
Insert the registerizer into the Cuda Codegen pass list, to enable scalar replacement and close the gap in simple reduction performance.

First up the good stuff, benchmark before:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.7917          9.7037          6.9386          6.0448
          (100, 100)          5.9338          14.972          7.1139          6.3254
        (100, 10000)          21.453          741.54          145.74          12.555
        (1000, 1000)          8.0678          122.75          22.833          9.0778

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.4502          7.9661          6.1469          5.5587
          (100, 100)          5.7613          13.897           21.49          5.5808
        (100, 10000)          21.702          82.398          75.462          22.793
        (1000, 1000)          22.527             129          176.51          22.517

```

After:
```
          Column sum          Caffe2             NNC          Simple          Better
           (10, 100)          6.0458          9.4966          7.1094           6.056
          (100, 100)          5.9299          9.1482          7.1693           6.593
        (100, 10000)          21.739          121.97          162.63          14.376
        (1000, 1000)          9.2374           29.01          26.883          10.127

             Row sum          Caffe2             NNC          Simple          Better
           (10, 100)          5.9773          8.1792          7.2307          5.8941
          (100, 100)          6.1456          9.3155          24.563          5.8163
        (100, 10000)          25.384          30.212          88.531          27.185
        (1000, 1000)          26.517          32.702          209.31          26.537
```

Speedup about 3-8x depending on the size of the data (increasing with bigger inputs).

The gap between NNC and simple is closed or eliminated - remaining issue appears to be kernel launch overhead. Next up is getting us closer to the _Better_ kernel.

It required a lot of refactoring and bug fixes on the way:
* Refactored flattening of parallelized loops out of the CudaPrinter and into its own stage, so we can transform the graph in the stage between flattening and printing (where registerization occurs).
* Made AtomicAddFuser less pessimistic, it will now recognize that if an Add to a buffer is dependent on all used Block and Thread vars then it has no overlap and does not need to be atomic. This allows registerization to apply to these stores.
* Fixed PrioritizeLoad mutator so that it does not attempt to separate the Store and Load to the same buffer (i.e. reduction case).
* Moved CudaAnalysis earlier in the process, allowing later stages to use the analyzed bufs.
* Fixed a bug in the Registerizer where when adding a default initializer statement it would use the dtype of the underlying var (which is always kHandle) instead of the dtype of the Buf.
* Fixed a bug in the IRMutator where Allocate statements logic was inverted to be replaced only if they did not change.
* Added simplification of simple Division patterns to the IRSimplifier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42878

Reviewed By: glaringlee

Differential Revision: D23382499

Pulled By: nickgg

fbshipit-source-id: 3640a98fd843723abad9f54e67070d48c96fe949
2020-08-31 10:39:46 -07:00
Alex Suhan
f20a04fa2d [TensorExpr] Simplify conditional select (#43350)
Summary:
Fold conditional select when both sides are constant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/43350

Test Plan: test_tensorexpr --gtest_filter=TensorExprTest.ConditionalSelectFold*

Reviewed By: pbelevich

Differential Revision: D23256602

Pulled By: asuhan

fbshipit-source-id: ec04b1e4ae64f59fa574047f2d7af55a717a5262
2020-08-21 11:15:48 -07:00
Nick Gibson
6fb5ce5569 [NNC] Fix some bugs in Round+Mod simplification (#42934)
Summary:
When working on the Cuda Codegen, I found that running the IRSimplifier before generating code lead to test fails. This was due to a bug in Round+Mod simplification (e.g. (x / y * y) + (x % y) => x) to do with the order in which the terms appeared. After fixing it and writing a few tests around those cases, I found another bug in simplification of the same pattern and have fixed it (with some more test coverage).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/42934

Reviewed By: zhangguanheng66

Differential Revision: D23085548

Pulled By: nickgg

fbshipit-source-id: e780967dcaa7a5fda9f6d7d19a6b7e7b4e94374b
2020-08-13 09:47:21 -07:00