Commit graph

81624 commits

Author SHA1 Message Date
atalman
c17ba69ba5 [submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) (#141901)
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590

Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-12-03 00:16:35 +00:00
soulitzer
e41a0b33ec Allow Fakified subclass to have different device for inner and outer tensor (#141839)
Previously if a wrapper tensor subclass is fakified, the inner tensors would end up having the same device as the outer tensor. This PR makes it so that inner and outer tensors can have different devices.

See OffloadTensor PR https://github.com/pytorch/pytorch/pull/141840/files#diff-3bc0cf540b694f4ec0a3749f78b047456657a53a5657e495ffb68e5970c5fdaaR1955 for an application. A simpler test has been added in this PR.

This is technically bc-breaking because now the callback passed to MetaConverter needs to accept an extra argument, but no one external should be using this anyway?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141839
Approved by: https://github.com/bdhirsh
ghstack dependencies: #141166
2024-12-03 00:09:41 +00:00
Chris Sidebottom
9830e7b1e4 Update OpenBLAS to 0.3.28 (#137263)
This includes a number of performance improvements, such as threading optimisations and forwarding GEMM calls to GEMV for calls where N=1 or M=1.

See: https://github.com/OpenMathLib/OpenBLAS/releases

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137263
Approved by: https://github.com/malfet
2024-12-03 00:05:34 +00:00
Nikita Shulga
9f9105a67b [MPS] Write/Invoke Metal shaders from C++ (#141547)
By introducing `DynamicMetalShaderLibrary` and `MetalShaderFunction`
Add unittests that also serves as an example of how API works

Using this primitive, one can compile and dispatch any 1D or 2D shader over MPS tensor using the following pattern
```cpp
auto x = torch::empty({8, 16}, at::device(at::kMPS));
DynamicMetalShaderLibrary lib(R"MTL(
  kernel void full(device float* t, constant ulong2& strides, uint2 idx [[thread_position_in_grid]]) {
    t[idx.x*strides.x + idx.y*strides.y] = idx.x + 33.0 * idx.y;
  }
)MTL");
auto func = lib.getKernelFunction("full");
func->runCommandBlock([&] {
   func->startEncoding();
   func->setArg(0, x);
   func->setArg(1, x.strides());
   func->dispatch({8, 16});
});

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141547
Approved by: https://github.com/Skylion007
2024-12-02 23:57:59 +00:00
iupaikov-amd
5c2584a14c [ROCm] Enable inductor GEMM lowering for gfx11 (#141687)
This check doesn't make sense for some of the AMD gpus since they have the right amount of CUs but multi_processor_count returns WGPs on RDNA while still performing adequately. A lot of tests fail on modern archs due to this check defaulting them to not using the GEMMs backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141687
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
2024-12-02 22:13:34 +00:00
chunhuanMeng
1f3d8896bc Fix mismatched tensor metadata between FakeTensor and Intel XPU concrete tensor when running F.logsigmoid (#141333)
Fixes https://github.com/pytorch/pytorch/issues/141332
`F.logsigmoid` will return two outputs: `output` and `buffer`.
For `F.logsigmoid` cpu path, it will use buffer to store some intermediate values and use them when computing gradients, so it returns a `buffer` tensor with nonzero size. For cuda and xpu paths, buffer is useless, so the `buffer ` tensor size of xpu `F.logsigmoid`  will be zero, just like cuda. The root cause of the issue is that the codes in `decompositions.py` (ref:https://github.com/pytorch/pytorch/blob/main/torch/_decomp/decompositions.py#L2803) only handle the cuda cases, when the a fake tensor with device is xpu run to here, it will use the cpu path and return a `buffer` with nonzero size, which is conflict to the  implementation of intel xpu concrete tensor. Therefore this pr add conditions to handle xpu cases. Make sure the two returned buffer sizes match each other.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141333
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/ezyang
2024-12-02 22:09:20 +00:00
Avik Chaudhuri
74eb92ed6e fix deep copy of empty graph (#141660)
Differential Revision: D66532131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141660
Approved by: https://github.com/ezyang
2024-12-02 22:03:13 +00:00
Bin Bao
41e59754b4 [CI] Remove inductor-perf-test-nightly-a10g.yml (#141895)
Summary: Deprecate the A10g nightly perf run. The workflow was introduced as an experiment and doesn't seem to be used by developers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141895
Approved by: https://github.com/huydhn
2024-12-02 21:55:20 +00:00
cyy
55250b324d [1/N] Apply py39 ruff fixes (#138578)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138578
Approved by: https://github.com/Skylion007
2024-12-02 21:46:18 +00:00
PyTorch MergeBot
b47bdb06d8 Revert "[inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)"
This reverts commit 942a2438e2.

Reverted https://github.com/pytorch/pytorch/pull/141334 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/141334#issuecomment-2512891840))
2024-12-02 21:29:02 +00:00
PyTorch MergeBot
6b05e31042 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 61534391ba.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a lot of failures shows up after this lands ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2512890426))
2024-12-02 21:26:13 +00:00
Colin L. Rice
64d44a39a1 remote_cache: Add a waitcounter for gets and sets (#141307)
This adds a basic waitcounter to help show if we're spending a lot of
time doing gets and sets to remote caches

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141307
Approved by: https://github.com/masnesral
2024-12-02 20:48:47 +00:00
PyTorch MergeBot
daa77f3d9f Revert "[BE]: Update mypy to 1.13.0 (#140808)"
This reverts commit 00134d68af.

Reverted https://github.com/pytorch/pytorch/pull/140808 on behalf of https://github.com/huydhn due to This is failing a distributed test in trunk, target determination missed this test and did not run it on PR ([comment](https://github.com/pytorch/pytorch/pull/140808#issuecomment-2512788426))
2024-12-02 20:47:43 +00:00
Benjamin Glass
54adbbf6b8 cpp_wrapper: Add support for MemoryFormat arguments (#141367)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141367
Approved by: https://github.com/desertfire
2024-12-02 20:40:24 +00:00
Edward Z. Yang
30574380a3 [REFACTOR] Factor _fx_graph_cache_key and _time_taken_ns to common base class (#141878)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141878
Approved by: https://github.com/jamesjwu
ghstack dependencies: #141877
2024-12-02 20:07:12 +00:00
Edward Z. Yang
61534391ba [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-12-02 19:48:05 +00:00
Huy Do
fe68f61c59 Migrate micro benchmark results to benchmark database schema v3 (#141745)
Similar to https://github.com/pytorch/pytorch/pull/141087, this uploads the micro benchmark results to benchmark database with its new schema v3. The data can then be queried.

~I'm testing with `inductor-micro-benchmark-x86` which should be sufficient because `inductor-micro-benchmark` is broken atm.  The CSV output stays for now until the dashboard is migrated to schema v3.~ https://github.com/pytorch/pytorch/issues/141747 has been resolved, so inductor-micro-benchmark should work now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141745
Approved by: https://github.com/yanboliang
2024-12-02 19:45:51 +00:00
cyy
ab5467897a Fix NOLINTNEXTLINE (#141794)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-12-02 19:22:00 +00:00
soulitzer
161a2340ee Switch to using Python nested int (#141166)
Doesn't seem to noticeably slow down eager - TestNestedTensorSubclass tests with and without the PR finished in similar amounts of time (around 57s, 58s)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141166
Approved by: https://github.com/ezyang
2024-12-02 19:17:30 +00:00
Ryan Guo
2d708752f0 [dynamo] Remove AutoDerefLocalSource and simplify cell handling (#141629)
This patch
1. removes `AutoDerefLocalSource` in favor of `LocalSource`, thereby
   removing its special handling in guards.
2. introduces a `LocalCellSource` for cells from the root frame, with
   only `reconstruct` implemented, to programmatically enforce that thse
   cells should never be used by other components like guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141629
Approved by: https://github.com/jansel
ghstack dependencies: #141628
2024-12-02 19:09:30 +00:00
Ryan Guo
e14d8c980f [dynamo][NFC] Rename NewCellVariable to CellVariable (#141628)
It was named `NewCellVariable` because we originally used it to
represent cells by the code Dynamo is tracing through. However, now we
use it to represent pre-existing cells as well, so this patch renames it
to avoid confusion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141628
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-12-02 19:09:30 +00:00
Colin L. Rice
0989871ac9 pytorch/feature: Record if parallel compile is enabled (#141074)
This gets a bit messy, but this appears to be the best spot to make a
true / false decision.

Note that since we're looking at whether or not it's used, if the pool
doesn't warm up within the time it takes for a compile, we will mark the
feature use as false.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141074
Approved by: https://github.com/masnesral
ghstack dependencies: #141059
2024-12-02 19:09:11 +00:00
Aaron Gokaslan
00134d68af [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-02 18:47:54 +00:00
PyTorch MergeBot
9012e7a62f Revert "[dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)"
This reverts commit 07850bb2c1.

Reverted https://github.com/pytorch/pytorch/pull/137397 on behalf of https://github.com/atalman due to Failing internal test ([comment](https://github.com/pytorch/pytorch/pull/137397#issuecomment-2511934283))
2024-12-02 16:05:14 +00:00
PyTorch MergeBot
eb7deb2db5 Revert "Fix NOLINTNEXTLINE (#141794)"
This reverts commit 7dd9b5fc43.

Reverted https://github.com/pytorch/pytorch/pull/141794 on behalf of https://github.com/atalman due to [GH job link](https://github.com/pytorch/pytorch/actions/runs/12087979418/job/33711943084) [HUD commit link](7dd9b5fc43) ([comment](https://github.com/pytorch/pytorch/pull/141794#issuecomment-2511789484))
2024-12-02 15:07:50 +00:00
PyTorch MergeBot
a34a56f69f Revert "Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)"
This reverts commit 795f28ac55.

Reverted https://github.com/pytorch/pytorch/pull/141625 on behalf of https://github.com/albanD due to Broken main ([comment](https://github.com/pytorch/pytorch/pull/141625#issuecomment-2511639687))
2024-12-02 14:10:38 +00:00
PyTorch MergeBot
ec96597e47 Revert "ILP for auto FSDP wrapping (#140298)"
This reverts commit d4cdc09881.

Reverted https://github.com/pytorch/pytorch/pull/140298 on behalf of https://github.com/xuanzhang816 due to for other PR ([comment](https://github.com/pytorch/pytorch/pull/140298#issuecomment-2511638743))
2024-12-02 14:08:04 +00:00
Valentine233
942a2438e2 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-12-02 08:42:10 +00:00
leslie-fang-intel
96d2a511ce [Inductor][CPP] Fix issue in CPP GEMM Template Prune Tensor (#141798)
**Summary**
When addressing [issue #134998](https://github.com/pytorch/pytorch/issues/134998), we will verify if any node in the current graph shares the same storage as the node we intend to prune. In the implementation, we assumed that when creating the `GraphLowering` in post-grad phase, there would be no `submodules`, and all `get_attr` nodes would correspond to a `torch.Tensor`. However, this assumption proves incorrect when enabling `FlexAttention`. In this scenario, `submodules` are present as `get_attr` node in post-grad phase. For example:

```
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]     class sdpa_score30(torch.nn.Module):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]         def forward(self, arg0_1: "bf16[][]cpu", arg1_1: "i32[][]cpu", arg2_1: "i32[][]cpu", arg3_1: "i32[][]cpu", arg4_1: "i32[][]cpu"):
V1128 23:23:47.071000 1965794 torch/_inductor/compile_fx.py:875] [0/1] [__post_grad_graphs]             return arg0_1

V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_score30 = self.sdpa_score30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         sdpa_mask30 = self.sdpa_mask30
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         flex_attention_30 = torch.ops.higher_order.flex_attention(add_276, index_put_60, index_put_61, sdpa_score30, (_frozen_param293, _frozen_param295, _frozen_param296, _frozen_param297, _frozen_param298, _frozen_param299, _frozen_param300, _frozen_param301, 64, 64, sdpa_mask30), 0.08838834764831843, {'SKIP_MASK_SCORE': True, 'PRESCALE_QK': False, 'ROWS_GUARANTEED_SAFE': False, 'BLOCKS_ARE_CONTIGUOUS': False, 'OUTPUT_LOGSUMEXP': False}, (), (_frozen_param294,));  add_276 = sdpa_score30 = sdpa_mask30 = None
V1128 23:23:45.482000 1965794 torch/_inductor/freezing.py:118] [0/1]         getitem_60: "bf16[1, 32, 1, 128]" = flex_attention_30[0];  flex_attention_30 = None
```
We added an extra check in the implementation to ensure only comparing the `get_attr` node with `torch.Tensor`. It is difficult to reproduce this issue using pure high-order operators. Adding a unit test after https://github.com/pytorch/pytorch/pull/141453 lands would be more straightforward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141798
Approved by: https://github.com/jgong5
2024-12-02 07:38:57 +00:00
PyTorch MergeBot
90f4d60672 Revert "export AOTI_TORCH_EXPORT on Windows. (#140030)"
This reverts commit daed864f7b.

Reverted https://github.com/pytorch/pytorch/pull/140030 on behalf of https://github.com/xuhancn due to need to fix on XPU. ([comment](https://github.com/pytorch/pytorch/pull/140030#issuecomment-2510737212))
2024-12-02 07:10:41 +00:00
cyy
8cada5cbe5 Use std::apply (#141834)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141834
Approved by: https://github.com/Skylion007
2024-12-02 05:49:10 +00:00
Adnan Akhundov
f16e08042c [user triton] Fix grid codegen for configs with empty kwargs (#141824)
Fixes #141823 by adding special handling of the codegen `if <config kwargs>: return <grid>` for the cases when there are no kwargs in the config.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141824
Approved by: https://github.com/Chillee
2024-12-02 04:17:21 +00:00
Xu Han
daed864f7b export AOTI_TORCH_EXPORT on Windows. (#140030)
Fixes #139954

reproduce UT:
```cmd
pytest test/inductor/test_torchinductor_codegen_dynamic_shapes.py -k test_device_assert_dynamic_shapes_cpu
```
Issue:
<img width="856" alt="image" src="https://github.com/user-attachments/assets/5fc501a9-54e5-45ac-9fb3-509ec11a7abe">

After fixing:
![Image](https://github.com/user-attachments/assets/883846fb-8e92-4b9c-9400-daab32382a3a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140030
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-12-02 03:20:29 +00:00
Yutao Xu
81ab2cc757 Update torch-xpu-ops commit pin (#141201)
Update the torch-xpu-ops commit to [1e32bbc](1e32bbc3d9), includes:

- Improve XPU aten operator coverage
- Support basic `SparseXPU` operators

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141201
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-12-02 01:49:07 +00:00
chilli
795f28ac55 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-02 00:35:29 +00:00
chilli
8eb259fdc3 Added option to control number of kernel options displayed (#138788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788
Approved by: https://github.com/drisspg
2024-12-02 00:35:29 +00:00
cyyever
fc74ec4989 [2/N] Avoid copy in std::get (#141826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-02 00:16:48 +00:00
Jason Ansel
b2fe1b9409 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-12-01 19:47:41 +00:00
Roy Hvaara
90f19fee8a [MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d (#141780)
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.

Added a regression test that verifies the output by running the same op on the CPU.

I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?

Fixes #141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
2024-12-01 18:36:53 +00:00
Blaine Burton Rister
5deca07c0d [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-01 09:54:34 +00:00
cyy
96be048f06 [1/N] Avoid copy in std::get (#141812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812
Approved by: https://github.com/Skylion007
2024-12-01 03:53:35 +00:00
Blaine Burton Rister
c2fa544472 [Inductor] move block pointer analysis to a new module (#141733)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing.

# Test plan

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733
Approved by: https://github.com/jansel
2024-11-30 23:21:24 +00:00
Blaine Burton Rister
49fde426ba [Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
2024-11-30 22:38:13 +00:00
Fabian Keller
394c339691 improve typings in unflatten (#141817)
A first follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `unflatten.py` which had been inaccurate due to the previously suppressed type checking on `torch.nn.Module`.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141817
Approved by: https://github.com/Skylion007
2024-11-30 22:12:15 +00:00
FFFrog
8a81f7a4b6 Refactor functions in functorch for functional (#141808)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141808
Approved by: https://github.com/Skylion007
2024-11-30 20:15:40 +00:00
atalman
0f3f801fc2 Add windows CUDA 12.6 nightly builds (#141805)
Windows AMI was published to prod. This PR adds CUDA 12.6 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141805
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-11-30 14:39:47 +00:00
eqy
9532589b53 [CUDA][64-bit indexing] Support 64-bit indexing in distribution_elementwise_grid_stride_kernel (#141613)
For #141544
Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2024-11-30 06:55:02 +00:00
Edward Z. Yang
7fafaa9c82 Introduce CompiledAOTI (#141695)
Stacked on https://github.com/pytorch/pytorch/pull/141691

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141695
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683, #141685, #141688, #141689, #141691
2024-11-30 00:05:41 +00:00
Bob Ren
2f72635a5c automatic dynamic unspecialize float (#141647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647
Approved by: https://github.com/ezyang
2024-11-29 22:36:53 +00:00
cyy
e29dabbd71 Fix performance-unnecessary-copy-initialization (#141792)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141792
Approved by: https://github.com/Skylion007
2024-11-29 22:10:06 +00:00