Commit graph

715 commits

Author SHA1 Message Date
Tom Ritchford
bdf5a6dca9 Add decomposition for unsqueeze_copy (#130942)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942
Approved by: https://github.com/peterbell10
2024-07-29 21:13:37 +00:00
Tom Ritchford
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
Manuel Candales
d6115439be [MPS] Add SDPA implentation (#131362)
This work is based off @malfet's #119200

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362
Approved by: https://github.com/kimishpatel
2024-07-25 03:24:37 +00:00
Oguz Ulgen
4ca8705035 Add mypy typing to fx node (#131434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434
Approved by: https://github.com/zou3519
2024-07-23 17:00:31 +00:00
Tom Ritchford
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
Tom Ritchford
500cbb5b90 Add decomposition for view_copy (#130938)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938
Approved by: https://github.com/peterbell10
ghstack dependencies: #130937
2024-07-21 20:39:24 +00:00
Isuru Fernando
bb4251213b Add decomposition for channel_shuffle (#118775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775
Approved by: https://github.com/peterbell10
2024-07-20 01:24:41 +00:00
Isuru Fernando
43a6d20883 Add decomposition for reflection_pad{1,2,3}d_backward (#130299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130299
Approved by: https://github.com/lezcano
ghstack dependencies: #130130
2024-07-17 21:56:00 +00:00
Shangdi Yu
ea4b80e6d6 [FX][export] strict DCE pass, check schema for node impurity (#130552)
Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects.

For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully.

 A call_function node is impure if it has at least one mutable argument.

Fixed the tests below:

test_to_module_with_mutated_buffer_multiple_update_sub_later
test_export_input_mutation_static_shape
test_buffer_util

Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552
Approved by: https://github.com/pianpwk
2024-07-12 15:43:27 +00:00
PyTorch MergeBot
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b8.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
PyTorch MergeBot
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb21098.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
Tom Ritchford
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
Tom Ritchford
68751799b8 Add decompositions for copy variants of view ops (#128416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 01:39:09 +00:00
Yukio Siraichi
a79bb8db91 Make _embedding_bag_backward explicitly dispatch to CPU and CUDA. (#129691)
This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it
dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`.

*Context:* PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not
allow third party backends (e.g. XLA) to modify its implementation, since this dispatch
key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a
dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors.

*Problem:* `_embedding_bag_backward` has a `sparse` parameter that controls whether the
operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does
not support sparse tensors. In order to fallback that execution to dense, i.e. change the
flag at runtime, we need to be able to modify its implementation.

*Solution:* we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA,
which allowed us to introduce our own kernel for it.

Additionally, this PR refactored the representation of its mode from constant integers
into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode`
and `int != EmbeddingBagMode`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691
Approved by: https://github.com/lezcano
2024-07-03 21:54:49 +00:00
PyTorch MergeBot
fa6c0fe3e4 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit 9450e198aa.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2197790226))
2024-06-29 00:16:47 +00:00
FEI
59e4e92556 sdp::SDPBackend::flash_attention support PrivateUse1 (#126392)
Fixes https://github.com/pytorch/pytorch/issues/124271

cc  @cpuhrsch @drisspg @albanD @soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392
Approved by: https://github.com/drisspg
2024-06-28 17:48:40 +00:00
Isuru Fernando
c12a4f2e65 Add decomposition for slice_scatter (#123744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744
Approved by: https://github.com/peterbell10
2024-06-28 17:02:10 +00:00
Peter Bell
3fc279633b [ATen] Make argsort.stable CompositeImplicitAutograd (#129529)
It literally just calls `at::sort` and returns the indices, so is composite compliant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129529
Approved by: https://github.com/lezcano
2024-06-27 23:49:16 +00:00
Antoni Viros
9450e198aa Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-06-27 03:41:28 +00:00
Joel Schlosser
31d5753247 Short-term fix to preserve NJT metadata cache in torch.compile (#122836)
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.

For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.

**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**

Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
2024-06-20 23:15:53 +00:00
PyTorch MergeBot
b0d2fe6299 Revert "Short-term fix to preserve NJT metadata cache in torch.compile (#122836)"
This reverts commit 2a41fc0390.

Reverted https://github.com/pytorch/pytorch/pull/122836 on behalf of https://github.com/jbschlosser due to internal test failures with DEBUG=1 asserts ([comment](https://github.com/pytorch/pytorch/pull/122836#issuecomment-2177298245))
2024-06-19 00:28:53 +00:00
Joel Schlosser
2a41fc0390 Short-term fix to preserve NJT metadata cache in torch.compile (#122836)
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.

For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.

**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**

Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
2024-06-17 15:25:09 +00:00
chilli
c486e2ab64 Add coloring to fx graph print out (#128476)
Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail.

Old:
<img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6">

New:
<img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476
Approved by: https://github.com/ezyang
2024-06-13 23:39:04 +00:00
Tom Ritchford
edb45dce85 Add OpInfo entry for as_strided_copy (#127231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231
Approved by: https://github.com/lezcano
2024-06-13 13:58:47 +00:00
Tom Ritchford
2386045e4f Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-12 09:39:58 +00:00
PyTorch MergeBot
3b73f5de3a Revert "Add OpInfo entry for alias_copy (#127232) (#128142)"
This reverts commit 04da6aeb61.

Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))
2024-06-10 16:17:16 +00:00
Tom Ritchford
04da6aeb61 Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-10 15:01:53 +00:00
PyTorch MergeBot
c58d3af3b4 Revert "Add OpInfo entry for alias_copy (#127232)"
This reverts commit 457df212e1.

Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](457df212e1) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))
2024-06-06 15:44:47 +00:00
Tom Ritchford
457df212e1 Add OpInfo entry for alias_copy (#127232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232
Approved by: https://github.com/lezcano
2024-06-06 07:46:26 +00:00
Joel Schlosser
b42cfcabc4 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-06-03 23:41:54 +00:00
Jane Xu
601c5e085d Add _foreach_max (#127187)
This PR adds _foreach_max support, the second reduction foreach op we have :D

I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.

Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
2024-05-29 19:08:58 +00:00
PyTorch MergeBot
c34f8c7f91 Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)"
This reverts commit 5e69e11d09.

Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to sorry the Dr CI fix hasn't been merged yet and its still failing 5e69e11d09 https://github.com/pytorch/pytorch/actions/runs/9228887299/job/25393895252 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2130305958))
2024-05-24 20:26:07 +00:00
Joel Schlosser
5e69e11d09 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-05-24 19:16:29 +00:00
PyTorch MergeBot
99a11efc8a Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)"
This reverts commit e2f081837f.

Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to I think dr ci is wrong and the windows build failure is real e2f081837f https://github.com/pytorch/pytorch/actions/runs/9216826622/job/25357819877 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2128388126))
2024-05-24 02:37:46 +00:00
Joel Schlosser
e2f081837f Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-05-24 00:42:59 +00:00
PyTorch MergeBot
315389bfed Revert "Remove deprecated _aminmax operator (#125995)"
This reverts commit 0116ffae7f.

Reverted https://github.com/pytorch/pytorch/pull/125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](https://github.com/pytorch/pytorch/pull/125995#issuecomment-2113769497))
2024-05-16 01:45:37 +00:00
haozhe.zhu
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
Andres Lugo-Reyes
38b8b614a2 [ROCm] Implement forward AD for miopen_batch_norm (#125069)
Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069
Approved by: https://github.com/jeffdaily, https://github.com/andrewor14
2024-05-14 19:09:50 +00:00
PyTorch MergeBot
bd3cbdba2f Revert "[optim] add fused_adagrad support for CPU device (#124905)"
This reverts commit 1c3fe84033.

Reverted https://github.com/pytorch/pytorch/pull/124905 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing distributed multigpu test in trunk 1c3fe84033 ([comment](https://github.com/pytorch/pytorch/pull/124905#issuecomment-2108777063))
2024-05-13 20:53:22 +00:00
haozhe.zhu
1c3fe84033 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-13 01:16:20 +00:00
cyy
0116ffae7f Remove deprecated _aminmax operator (#125995)
It has been deprecated for a long time.

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125995
Approved by: https://github.com/ezyang
2024-05-12 17:50:17 +00:00
Aaron Orenstein
e71207b729 Fix infinite recursion in API BC test (#125706)
```
python test/test_fx.py -k test_public_api_surface
```
was failing with a complaint about infinite recursion. Fixed that and then marked the two API changes from #123681 as private (for `get_example_value`) and backward compatible (for `insert_deferred_runtime_asserts`).

Fixes #104012

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125706
Approved by: https://github.com/BoyuanFeng
2024-05-08 23:07:16 +00:00
chilli
b356a0de86 Add support for multiple flexattention calls in a single compile (#125516)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125516
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-05-07 21:37:37 +00:00
Xuehai Pan
30c9fd96f6 [FX] Add missing forbidden mutation methods in immutable collections (#125468)
Add `list.sort`, `list.reverse`, `dict.__ior__`, and `dict.setdefault`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125468
Approved by: https://github.com/Skylion007
2024-05-05 19:30:22 +00:00
Isuru Fernando
4d5f8070c4 add a decomposition for select_scatter (#124426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124426
Approved by: https://github.com/peterbell10
2024-05-01 03:23:18 +00:00
eqy
a866bfff45 [cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs

Will also collect benchmark data,

CC @drisspg

Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-27 04:15:49 +00:00
PyTorch MergeBot
7a6813b7b3 Revert "[cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)"
This reverts commit 64af899fdf.

Reverted https://github.com/pytorch/pytorch/pull/122510 on behalf of https://github.com/jeanschmidt due to Breaking amd gpu builds ([comment](https://github.com/pytorch/pytorch/pull/122510#issuecomment-2076743868))
2024-04-25 09:22:37 +00:00
eqy
64af899fdf [cuDNN] cuDNN SDPA (Flash Attention) Backward (#122510)
#113713
currently passing trivial smoke tests but I just totally pattern-matched bits and pieces of the autograd defs

Will also collect benchmark data,

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122510
Approved by: https://github.com/drisspg
2024-04-24 01:00:34 +00:00
Isuru Fernando
edcd968b51 Add out wrappers to some decompositions (#115437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115437
Approved by: https://github.com/lezcano
2024-04-23 06:26:11 +00:00
Jesse Cai
c9db59e9e4 [sparse] Add fast semi-structured spasification kernels (#122350)
This PR adds in fast semi-structured sparsification kernels to PyTorch.

These kernels allow for accelerated semi-structured sparsification
kernels in PyTorch.

The kernels have been added as aten native functions

In particular, three new functions have been added:

* `torch._sparse_semi_structured_tile`

This function will return the packed representation and metadata for
both X and X', as well as the thread masks. Note that this applies 2:4
sparsity in a 4x4 tile instead of a 1x4 strip as usual.

* `torch._sparse_semi_structured_apply`

This function takes in an input tensor and thread masks from the above
function and returns a packed representation and metadata from applying
thread masks to the input tensor.

* `torch._sparse_semi_structured_apply_dense`

This function does the same thing as above but instead of returning the
tensor in the sparse representation it returns it in the dense
representation

The subclasses have also been updated to add a new
`prune_dense_static_sort`
classmethod to create sparse tensors with this format. I've added some
additional documentatino on how to calculate the compressed tensors
needed to create a SparseSemiStructuredTensor oneself.

To this end, there are two new helper functions added:
`sparse_semi_structured_tile`
`compute_compressed_swizzled_bitmask`

Differential Revision: [D56190801](https://our.internmc.facebook.com/intern/diff/D56190801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122350
Approved by: https://github.com/cpuhrsch
2024-04-19 13:31:58 +00:00