Commit graph

227 commits

Author SHA1 Message Date
Jez Ng
e4d294221b [inductor] Log precompilation time (#136395)
This has been useful for diagnosing the long compile time issues I've seen in the Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136395
Approved by: https://github.com/eellison
2024-09-24 01:47:54 +00:00
PyTorch MergeBot
d0cebedb31 Revert "Add Triton CPU as an Inductor backend (#133408)"
This reverts commit e498b02b47.

Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))
2024-09-16 18:33:33 +00:00
PyTorch MergeBot
0199fd4d7e Revert "[inductor] More fixes on the keys of constants and signature dictionaries (#135406)"
This reverts commit e54b559e88.

Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))
2024-09-16 17:58:02 +00:00
Jez Ng
e498b02b47 Add Triton CPU as an Inductor backend (#133408)
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
Jokeren
e54b559e88 [inductor] More fixes on the keys of constants and signature dictionaries (#135406)
Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406
Approved by: https://github.com/jansel
2024-09-13 04:10:41 +00:00
Pushpak Raj Gautam
ee8c5cc1cc For S444023: Back out "deprecate search_autotune_cache (#133628)" (#135186)
Summary: For S444023

Test Plan:
Revert prevented the NaN errors - f639391901
Training job ran for 7767 iterations. NaN errors show up within the first 1k.

Reviewed By: nmacchioni

Differential Revision: D62224747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186
Approved by: https://github.com/kit1980
2024-09-11 14:08:40 +00:00
drisspg
b977abd5de [Inductor] Fix error checking for scaled_mm lowering (#134765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765
Approved by: https://github.com/Skylion007
2024-08-29 20:18:42 +00:00
Nicolas Macchioni
dd69013c7a deprecate search_autotune_cache (#133628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628
Approved by: https://github.com/oulgen
2024-08-16 09:29:39 +00:00
Nicolas Macchioni
5cb05a82b4 [BC breaking] move benchmarking + prefer inductor path (#132827)
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
2024-08-08 00:47:45 +00:00
Oguz Ulgen
09f9c256ad Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
PyTorch MergeBot
f2ddd5e9e0 Revert "Add basic mypy annotations to inductor (#132416)"
This reverts commit 78927d37f6.

Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
Oguz Ulgen
78927d37f6 Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-01 20:14:25 +00:00
eellison
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
PyTorch MergeBot
784a6ec5a3 Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)"
This reverts commit 13d744464f.

Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](13d744464f) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))
2024-07-31 16:49:21 +00:00
eellison
13d744464f Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
eellison
9598c58618 Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-30 01:57:53 +00:00
PyTorch MergeBot
62b2e7a553 Revert "Add config option to skip autotuning conv (#131839)"
This reverts commit 3d4de8e96d.

Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))
2024-07-29 22:31:51 +00:00
eellison
3d4de8e96d Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-29 16:43:58 +00:00
Alnis Murtovi
d57de73fe0 AutoHeuristic: Add support for kernel choice selection (#131610)
This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following:

When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py.
After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results.
I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610
Approved by: https://github.com/eellison
2024-07-26 16:35:55 +00:00
eellison
5772c13f56 Dont wrap negative indexing in scatter reduce (#131503)
Fix for https://github.com/pytorch/pytorch/issues/131321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503
Approved by: https://github.com/shunting314
2024-07-24 04:01:32 +00:00
eellison
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
Xuehai Pan
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
chilli
d59803fb67 Refactored flexattention kernel (#130904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904
Approved by: https://github.com/drisspg
ghstack dependencies: #130871
2024-07-19 04:56:32 +00:00
chilli
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
Bert Maher
d443fbc025 [inductor] Cache precompilation functions based on configs (#130350)
Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs.  Add the config's hashes to the key to avoid this problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350
Approved by: https://github.com/eellison
2024-07-12 14:21:49 +00:00
PyTorch MergeBot
da030e7add Revert "[Inductor] FlexAttention supports partial masking (#130415)"
This reverts commit 207564bab1.

Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant  ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))
2024-07-12 13:20:18 +00:00
Yanbo Liang
207564bab1 [Inductor] FlexAttention supports partial masking (#130415)
This is the new version of #130235

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415
Approved by: https://github.com/Chillee
2024-07-12 07:19:28 +00:00
Xuehai Pan
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
chilli
cd683212a2 Fix indexing twice with score_mod (#130224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106
2024-07-08 18:15:35 +00:00
chilli
bf609630ae Fix a bunch of stride issues with FlexAttention (#130160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160
Approved by: https://github.com/yanboliang
2024-07-06 03:58:14 +00:00
Max Podkorytov
79959d707c [Inductor][ROCm] Composable Kernel backend for Inductor (#125453)
This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection.

The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are
 (1) customizing compiler for the ROCm platform
 (2) customizing template code generation for Composable Kernel Universal GEMM instances.

We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance.

### Testing
Install the ck library
```
pip install git+https://github.com/rocm/composable_kernel@develop
```
Run the test
```
TORCH_LOGS=+torch._inductor \
pytest --capture=tee-sys test/inductor/test_ck_backend.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-25 20:54:14 +00:00
chilli
237c4e6163 Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-21 15:58:53 +00:00
Bin Bao
62e5d045c0 [AOTI] Auto-tune Triton kernels in a seperate block (#129057)
Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library.

There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor:
* Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation.
* Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption.
* Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while.

This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before.

Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-21 14:34:13 +00:00
PyTorch MergeBot
f73b451e78 Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013)"
This reverts commit ff89ebc50a.

Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/huydhn due to Sorry for reverting your change but one of the test_torchinductor_opinfo test starts to fail after this commit ff89ebc50a, I am reverting to see if it helps trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2182042422))
2024-06-21 05:46:46 +00:00
chilli
ff89ebc50a Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-21 03:01:16 +00:00
PyTorch MergeBot
832fc35211 Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013)"
This reverts commit 6d2b3c90f1.

Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing a flexattention test to fail on ROCm. Can you please fix that test before remerging this in? See 6d2b3c90f1 for details ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2181133070))
2024-06-20 16:51:41 +00:00
chilli
6d2b3c90f1 Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-20 05:15:48 +00:00
Bin Bao
ba92f5277f [inductor][refactor] Unify the use of generate_kernel_call (#128467)
Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack.

Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467
Approved by: https://github.com/shunting314
2024-06-19 07:47:25 +00:00
chilli
eb9f4da11e Modified template indexing to broadcast indices to out instead of mask and some other flexattention micro-opts (#128938)
For headdim=64 and headdim=128

Old:
<img width="656" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2c5d1613-96dc-4300-8dc0-dccaef59e73c">

New:
<img width="644" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/730004a8-6d5f-46a5-82a0-2594feb5e192">

Note, this does regress headdim=256. We can unregress it by special casing `headdim=256`, but ehh.... we can do it later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128938
Approved by: https://github.com/drisspg
2024-06-19 03:41:22 +00:00
Jason Ansel
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
Edward Z. Yang
3964a3ec73 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

**Reland notes.** This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/

It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-09 06:20:25 +00:00
Aaron Orenstein
afe15d2d2f Flip default value for mypy disallow_untyped_defs [3/11] (#127840)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127840
Approved by: https://github.com/oulgen
2024-06-08 18:28:01 +00:00
PyTorch MergeBot
ac51f782fe Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit 2f7cfecd86.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))
2024-06-07 16:01:46 +00:00
Edward Z. Yang
2f7cfecd86 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-06 02:29:45 +00:00
Edward Z. Yang
d3ad84c38f Use pexpr, not texpr in Triton launch codegen (#128038)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128038
Approved by: https://github.com/Skylion007
2024-06-06 00:45:59 +00:00
PyTorch MergeBot
d5cb5d623a Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit fb696ef3aa.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))
2024-06-05 03:57:58 +00:00
Edward Z. Yang
fb696ef3aa Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-04 11:47:32 +00:00
Yanbo Liang
e793ae220f [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-04 04:27:24 +00:00
Yanbo Liang
69f5b66132 [Inductor] FlexAttention backward kernel optimization (#127208)
BWD Speedups (before this PR):
```
| Type    |   Speedup | shape             | score_mod     | dtype          |
|---------|-----------|-------------------|---------------|----------------|
| Average |     0.211 |                   |               |                |
| Max     |     0.364 | (16, 16, 512, 64) | relative_bias | torch.bfloat16 |
| Min     |     0.044 | (2, 16, 4096, 64) | causal_mask   | torch.bfloat16 |
```
BWD Speedups (after this PR, though not optimizing block size yet):
```
| Type    |   Speedup | shape              | score_mod     | dtype          |
|---------|-----------|--------------------|---------------|----------------|
| Average |     0.484 |                    |               |                |
| Max     |     0.626 | (2, 16, 512, 256)  | head_bias     | torch.bfloat16 |
| Min     |     0.355 | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 |
```

There are a few things need to do as follow-ups:
* Optimized default block size on A100/H100.
* Support different seqlen for Q and K/V.
* Support dynamic shapes for backward.
* Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208
Approved by: https://github.com/Chillee
2024-06-04 04:22:41 +00:00
PyTorch MergeBot
2fc907971a Revert "[Inductor] FlexAttention backward kernel optimization (#127208)"
This reverts commit f7171313ab.

Reverted https://github.com/pytorch/pytorch/pull/127208 on behalf of https://github.com/yanboliang due to test_flex_attention is failing internally ([comment](https://github.com/pytorch/pytorch/pull/127208#issuecomment-2145830810))
2024-06-03 18:13:27 +00:00