Commit graph

81673 commits

Author SHA1 Message Date
drisspg
42547f8d48 Add support for blackwell codegen (#141724)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141724
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/eqy
2024-12-03 20:34:43 +00:00
Mu-Chu Lee
8b0fcad0fd [AOTInductor] Add update_constant_buffer pybind support (#140755)
Summary: We add update_constant_buffer python support for testing purpose.

Test Plan: Included in commit

Differential Revision: D65968613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140755
Approved by: https://github.com/22quinn
2024-12-03 20:34:25 +00:00
Ting Lu
e5f5283ab2 Fix cuda arch full version for 12.6 (#141976)
follow up for https://github.com/pytorch/pytorch/pull/141433/files
build still showing up as 12.6.2 in the name, see latest https://github.com/pytorch/pytorch/actions/runs/12134985224/job/33833276884.

related to https://github.com/pytorch/pytorch/issues/138440

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141976
Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/Skylion007
2024-12-03 20:33:01 +00:00
Fabian Keller
f472b3aee1 improve typings around torch.export (#141829)
This is another follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `torch._export`. Even though the PR introduces a few runtime type asserts, the runtime behavior should stay equivalent, because the failed assertions should have been immediate crashes anyway.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141829
Approved by: https://github.com/ezyang
2024-12-03 19:57:21 +00:00
Bob Ren
43c5f59190 flip capture_autograd_function to default to true and warn if false (#141972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141972
Approved by: https://github.com/zou3519
ghstack dependencies: #141932
2024-12-03 19:50:14 +00:00
angelayi
96a35716d1 [aoti] Improve OSSProxyExecutor error messages (#141501)
For debugging issues like https://fb.workplace.com/groups/1028545332188949/permalink/1092584242451724/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141501
Approved by: https://github.com/henrylhtsang
2024-12-03 19:32:49 +00:00
Colin L. Rice
6b620423a3 dynamo_timed: Add a log_waitcounter option. (#141402)
This logs a waitcounter of the name pytorch.dynamo_timed.{key}.

Primarily sending this now to make sure everyone likes the API, then
I'll add tests, and migrate one dynamo_timed to use it. (likely starting
with
https://github.com/pytorch/pytorch/pull/141379).

Testing is a bit harder, since we don't normally have any way to read
_WaitCounter state AFAICT. I want to poke around and see if I can figure
out a way to read the state, otherwise I'll just mock it to at least
make sure it's mostly working.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141402
Approved by: https://github.com/jamesjwu, https://github.com/masnesral
2024-12-03 19:24:29 +00:00
drisspg
d35358b271 [FlexAttention] Remove failing num_warps=8 in bwds (#141653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141653
Approved by: https://github.com/BoyuanFeng
2024-12-03 19:22:52 +00:00
dan_the_3rd
9125e9119c Fix memory leak in ModuleTracker (#141960)
Thanks @drisspg and @albanD for finding the fix

**TEST PLAN**
```
import gc
import torch
import torch.nn as nn
from torch.utils.module_tracker import ModuleTracker

class MyModel(nn.Module):
    def forward(self, x):
        return x * x

print(f"torch=={torch.__version__}")
m = MyModel()
m.cuda()
m.to(torch.bfloat16)
mt = ModuleTracker()
for i in range(1000):
    if i % 100 == 0:
        gc.collect()
        print("memory_allocated:", torch.cuda.memory_allocated())
    x = torch.randn([128, 256], device="cuda", dtype=torch.bfloat16, requires_grad=True)
    with mt:
        m(x)

```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141960
Approved by: https://github.com/albanD
2024-12-03 18:36:15 +00:00
iupaikov-amd
7bb2228ffd Test cpp_wrapper_hipify string comparison (#141353)
Updating the test to match this code that takes device warpsize into account: cf1d95a965/torch/_inductor/codegen/cuda/device_op_overrides.py (L120)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141353
Approved by: https://github.com/desertfire
2024-12-03 18:25:32 +00:00
Chien-Chin Huang
8b5c26287d Initialize lr as a tensor if it is originally a tensor (#141620)
Fix https://github.com/pytorch/pytorch/issues/139575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141620
Approved by: https://github.com/kwen2501
2024-12-03 18:10:23 +00:00
Uttam Thakore
314e08eb52 [fr_trace][bugfix] Log missing ranks when provided (#141924)
Summary: For missing ranks issues, `build_collectives` doesn't log any errors (5c2584a14c/tools/flight_recorder/components/builder.py (L293C23-L306C24)), which means that when `EntryState.to_collective` is called [here](5c2584a14c/tools/flight_recorder/components/builder.py (L400C21-L405C22)), errors will be empty and `to_collective` will enter the first if statement. But that codepath doesn't log `missing_ranks`, meaning it will be absent from the `Collective` returned. This diff fixes that oversight.

Test Plan:
eyes

Sandcastle run

Differential Revision: D66679224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141924
Approved by: https://github.com/c-p-i-o
2024-12-03 17:54:43 +00:00
Andrew Gu
5c59f4a55a Remove old FSDP1 fully_shard (#141875)
FSDP1's `fully_shard` frontend was an exploration at the end of 2022 H2 as part of the `torch/distributed/_composable` APIs to avoid `nn.Module` wrappers. It calls into the same backend code as FSDP1's `FullyShardedDataParallel`.

The API did not gain traction internally, so we instead reused the name `fully_shard` for FSDP2, which similarly is not an `nn.Module` wrapper and follows similar design principles as FSDP1's `fully_shard`.

To the best of our knowledge, we have removed all instances of FSDP1's `fully_shard` internally, and we put the deprecation warning in open source in 2.4 saying it will be removed after 2.5 (which is now):
4959784dac/torch/distributed/_composable/fully_shard.py (L40-L48)

We are skipping the PR sanity check because this PR is only removing code, not adding new code, and should not require this sanity check.

Differential Revision: [D66664988](https://our.internmc.facebook.com/intern/diff/D66664988)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141875
Approved by: https://github.com/weifengpy
2024-12-03 17:00:47 +00:00
rzou
ed4831b93c Improve torch.library.opcheck and register_autograd docs (#141883)
Fixes https://github.com/pytorch/pytorch/issues/141618
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141883
Approved by: https://github.com/albanD
ghstack dependencies: #141894, #141880
2024-12-03 16:28:56 +00:00
rzou
827c322290 Make torch.library.triton_op public (#141880)
We've been using it privately for half a year and everything's been
good. This PR:
1. Makes torch.library.triton_op public
2. Renames capture_triton -> wrap_triton. We got feedback that no one
   knew what "capture triton" does.
3. Makes torch.library.wrap_triton public.

triton_op is used to construct a Python custom operator that may call 1+
triton kernels. Each of those triton kernels must be annotated with
wrap_triton.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141880
Approved by: https://github.com/albanD
ghstack dependencies: #141894
2024-12-03 16:28:56 +00:00
rzou
ac600fdce6 Type exposed_in decorator (#141894)
Test Plan:
- lintrunner
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141894
Approved by: https://github.com/albanD
2024-12-03 16:28:17 +00:00
Nikita Shulga
7a806a839d [FP8] Expand MaskedSelect to float8 (#141928)
Needed for printing those.
Though I wonder if better solution would be to change those ops to use element size rather than actual type (to extend them automatically to unsigned integral types as well)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141928
Approved by: https://github.com/ezyang, https://github.com/jgong5
2024-12-03 15:14:26 +00:00
Xuehai Pan
78543e6002 [dynamo][pytree][1/N] make CXX pytree traceable: tree_iter / tree_leaves (#137397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137397
Approved by: https://github.com/jansel
2024-12-03 11:17:39 +00:00
Valentine233
9990b47ea3 [inductor][pattern matcher] revise mkldnn pattern matcher UT (#141334)
Fixes #139970, #139812.

Revise mkldnn pattern matcher UTs, to check the relevant specific matched patterns instead of the total matched number.
1) Add the missing specific counters in pattern matchers, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
2) In UTs, change the general `matcher_count`/`matcher_nodes` checks to the specific ones, e.g. `mkldnn_unary_fusion_matcher_nodes`/`mkldnn_conv_weight_pack_matcher_count`.
3) In UTs, remove the option of `matcher_count`/`matcher_nodes` params in _test_common and make `matcher_check_fn` a necessary param.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141334
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-12-03 09:26:43 +00:00
Ryan Guo
ff73e2e679 [dynamo] Validate mutation_type and source in VariableTracker.__init__ (#141717)
As title, this also uncovered a few invalid use cases; the cases that
cause error are fixed in separate patches prior to this patch, and the
rest are fixed in this patch.

This patch also moves a few `.source` mutation to variable construction,
to increase the coverage of the validation.

Fixes #133027.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141717
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902, #141716
2024-12-03 09:18:06 +00:00
Ryan Guo
0efd184685 [dynamo] Fix side effects for range iterator that escapes the graph (#141716)
`wrap_range_iterator` mistakenly used `ValueMutationNew`, when it
should've used `ValueMutationExisting`, because this code path always
has a source.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141716
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715, #141902
2024-12-03 09:18:06 +00:00
Ryan Guo
7c3c8a662e [dynamo] Add RANGE_ITERATOR_MATCH to properly guard on range iterators (#141902)
A subsequeunt patch attempts to fix a side-effect issue for range
iterators, which in turn exposed an exising issue on guards for range
iterators -- the following test started failing:
```
PYTORCH_TEST_WITH_DYNAMO=1 python test/test_tensor_creation_ops.py TestTensorCreationCPU.test_hstack_column_stack_cpu_int16
```

This patch adds a `RANGE_ITERATOR_MATCH` guard to make sure that we
properly guard on range iterators, and adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141902
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714, #141715
2024-12-03 09:18:06 +00:00
Ryan Guo
ff3f4a164c [dynamo] Fix aliasing issue for dict.copy that escapes the graph (#141715)
Dynamo accidentally passed the original `ConstDictVariable.source` to
the result of `dict.copy(...)`, which caused aliasing issue when the
result escapes the graph (e.g., is a return value).

This patch fixes that and adds a regression test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141715
Approved by: https://github.com/jansel
ghstack dependencies: #141713, #141714
2024-12-03 09:18:06 +00:00
Ryan Guo
9eb0520d75 [dynamo] Fix side-effect handling for pre-existing collections.deque (#141714)
Previously we never replayed side effects to `DequeVariable` with a
source; the bug was already in the `test_deque_input` test, but went
unnoticed because we didn't check the deque objects.

This patch adds limited but practical support for this (see comments in
`side_effects.py` for why limited), and updates the deque tests to check
for this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141714
Approved by: https://github.com/jansel
ghstack dependencies: #141713
2024-12-03 09:18:06 +00:00
Ryan Guo
f2ce2d435b [dynamo] Add test for returning a nested recursive function and update documentation (#141713)
Addresses https://github.com/pytorch/pytorch/pull/137905#discussion_r1806923085.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141713
Approved by: https://github.com/jansel
2024-12-03 09:18:06 +00:00
Mwiza Kunda
f8a64c324e Broadcast constants on vectorised stores in CppTile2DKernel (#140262)
Currently constants are not broadcasted on vectorised stores in `CppTile2DKernel`. This leads to errors like the following:
```shell
error:: request for member 'store' in 'tmp1', which is of non-class type 'signed char'
   61 |                                 tmp1.store(tmp2 + static_cast<int64_t>(8L*x0_inner), static_cast<int64_t>(8));
      |                                           ^~~~~
```
This PR adds the required broadcasting.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140262
Approved by: https://github.com/jgong5
2024-12-03 09:15:17 +00:00
Bob Ren
e1e3bbc2e1 Set capture_autograd_function=False by default (#141932)
https://github.com/pytorch/pytorch/pull/136959 cleaned up the flag and added a warning. @Chillee pointed out that we should really default this flag to false otherwise we subject all users that go down this code path to log spew.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141932
Approved by: https://github.com/jansel
2024-12-03 07:59:03 +00:00
Nikita Shulga
e499b46465 Speed up half tensors printing (#141927)
This PR removes copycast of reduced precision types to float before printing, that was added in https://github.com/pytorch/pytorch/pull/14418 to probably unblock printing when many operations, like `is_nan` and `max` were not supported on CPUs

(Reusing old test plan) Before the PR:
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
621 μs ± 5.06 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

after the PR
```python
In [1]: import torch; a = torch.rand(1, 1700, 34, 50, dtype=torch.float16)

In [2]: %timeit str(a)
449 μs ± 2.34 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
```

Also, this allows one printing 15Gb Metal tensors on 32GB Mac machine:
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='mps:0', dtype=torch.float16)
```

Before this change it failed with non-descriptive
```
% python3 -c "import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
    import torch;print(torch.empty(72250,72250, device='mps', dtype=torch.float16))
                 ~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor.py", line 568, in __repr__
    return torch._tensor_str._str(self, tensor_contents=tensor_contents)
           ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 708, in _str
    return _str_intern(self, tensor_contents=tensor_contents)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 625, in _str_intern
    tensor_str = _tensor_str(self, indent)
  File "/Users/malfet/git/pytorch/pytorch/torch/_tensor_str.py", line 339, in _tensor_str
    self = self.float()
RuntimeError: Invalid buffer size: 19.45 GB
```

Convert fp8 dtypes to float16, as float range is an overkill
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141927
Approved by: https://github.com/ezyang
2024-12-03 07:01:49 +00:00
Xiaozhu Meng
d035db3d86 [AMD] [submodule] aten.bmm CK-backend prototype (#140758)
Summary:
Early prototype of adding CK backend for aten.bmm. Currently, it is very limited in that:

1. BF16 only
2. A single CK instance
3. NT layout only
4. Alpha=1, Beta=0 only

Reviewed By: xw285cornell, zjing14

Differential Revision: D65954695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140758
Approved by: https://github.com/bradleyhd
2024-12-03 06:54:51 +00:00
Edward Z. Yang
6afcec0c58 Assert is GraphModule in compile_fx_aot (#141575)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141575
Approved by: https://github.com/Skylion007, https://github.com/desertfire
2024-12-03 05:39:44 +00:00
PyTorch MergeBot
ce86119503 Revert "Set remote cache version and backend type once in compilation metrics (#141707)"
This reverts commit d633cf1f55.

Reverted https://github.com/pytorch/pytorch/pull/141707 on behalf of https://github.com/malfet due to It breaks tests by referencing FbRemoteFxGraphCache, but CI was green ([comment](https://github.com/pytorch/pytorch/pull/141707#issuecomment-2513555185))
2024-12-03 05:01:02 +00:00
PyTorch MergeBot
2999dbfd21 Revert "[REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)"
This reverts commit 3ab4a28eaa.

Reverted https://github.com/pytorch/pytorch/pull/141877 on behalf of https://github.com/huydhn due to Job are failing en masse after this lands, so it looks like a land race ([comment](https://github.com/pytorch/pytorch/pull/141877#issuecomment-2513552752))
2024-12-03 04:57:58 +00:00
Nikita Shulga
38bbe37187 Enable CI on SM89 (#140305)
Using EC2 G6 instance, based on NVIDIA L4, added to scale config in https://github.com/pytorch/test-infra/pull/5376

To enable more balanced sharding, had to push 148ae19935

Added `@xfailIfSM89` to the following tests:
 - test_fp8_pattern_2
 - test_original_aten_preserved_split_addmm
 - test_sparse_semi_structured_scaled_mm
 - test_sparse_semi_structured_scaled_mm_fp8
 - test_sparse_fp8fp8_mm

Increased tolerance to 2e-4 for `RNNTest.BidirectionalMultilayerGRU_CPU_vs_CUDA`

Skipped following inductor tests (that either flaky OOMs or timeouts):
 - test_reduction_fn_std_float64
 - test_reduction_fn_var_mean_float64
 - test_multi_output_unbacked_custom_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140305
Approved by: https://github.com/wdvr, https://github.com/ZainRizvi
2024-12-03 04:49:46 +00:00
chilli
af88326250 Ensure that BlockMask length must always exactly match the sequence length in flex_attention (#141625)
Fixes https://github.com/pytorch/pytorch/issues/141435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141625
Approved by: https://github.com/drisspg
ghstack dependencies: #138788
2024-12-03 04:45:05 +00:00
Yidi Wu
9cfc9e636d [while_loop] change to guard_equals for checking output and carry (#141734)
The input with the same can be represented with different symbols e.g.
```python
def body_fn(a, b):
  return b.sin(), a.sin()
```
, where a = torch.randn(3, 4), b= torch.randn(3, 4). There could be 4 symbols allocated for a and b. So instead of checking their shapes and strides' symbol must be the same, we just use guard_equals to enforce the constraint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141734
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-12-03 04:00:21 +00:00
Thomas Bohnstingl
871b93bc59 [associative_scan] Fixing shape checks (#141698)
This PR fixes the shape checks that are done in the associative_scan operation.
Before all shapes of the input leaves were required to be the same. With this PR only the shapes of the output of the combine_fn and the input leaves need to be the same, but not among the input leaves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141698
Approved by: https://github.com/ydwu4
2024-12-03 03:49:11 +00:00
Edward Z. Yang
3ab4a28eaa [REFACTOR] Inline FxGraphCache.post_compile into sole call site (#141877)
I am going to break apart the arguments passed to the constituents
to only pass exactly what is needed, so easy access to the insides
is helpful here.

This also moves two helper functions to output_code.py as well.

Also set _boxed_call at constructor.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141877
Approved by: https://github.com/jamesjwu, https://github.com/jansel

Co-authored-by: James Wu <jjwu@meta.com>
2024-12-03 03:48:23 +00:00
Mikayla Gawarecki
ecbb8a8800 Mention version of flip in weights_only error message (#141304)
Fixes https://github.com/pytorch/pytorch/issues/141139

How the 3 versions of the error message now look

### Version 1

Old error message:

```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

New error message:

```
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options, do those steps only if you trust the source of the checkpoint.
        (1) In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL __main__._rebuild_class_that_uses_build_instruction was not an allowed global by default. Please use `torch.serialization.add_safe_globals([_rebuild_class_that_uses_build_instruction])` or the `torch.serialization.safe_globals([_rebuild_class_that_uses_build_instruction])` context manager to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````
### Version 2

Old error message:

```
_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
```

New error message:
```

_pickle.UnpicklingError: Weights only load failed. ``torch.nested`` and ``torch._dynamo`` must be imported to load nested jagged tensors (NJTs)
 In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.

 ```

 ### Version 3

Old error message
```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

New error message
```
_pickle.UnpicklingError: Weights only load failed. In PyTorch 2.6, we changed the default value of the `weights_only` argument in `torch.load` from `False` to `True`. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
````

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141304
Approved by: https://github.com/zou3519
2024-12-03 03:26:27 +00:00
Michal Gallus
4cbb3b4bd2 [ROCm] Enable finding HIP and ROCm libraries on Windows (#137279)
This PR introduces support for finding HIP-SDK Libraries on Windows.

Since reading the code changes using the diff view is a bit cumbersome due to introduced if branch, let me explain what was changed:
- The linux-specific steps to find HIP packages have been dragged into `if(UNIX) block`
- Windows steps follow in the `else()` clause

The separation was needed, because of several factors:
- HIP SDK for Windows typically names its components using `hip` in their names (for exmaple: `hip_version.h` instead of `rocm_version.h`, `HIP_VERSION_DEV_MAJOR` instead of `ROCM_VERSION_DEV_MAJOR`, etc.),
- The libraries included in HIP SDK are only a subset of what is available in Linux ROCm (missing hsa-rt, rccl, roctx)
- MIOpen isn't a part of HIP SDK, but can be built separately and as of now requires additional path to be defined using and env var.
- Windows can only find hip package in version greater than 1.0 and its libraries if the lowercase `find_package(hip ...)` is invoked first. This is because the lowercase `hip` name will cause the mechanism to find hip's packages using [config mode](https://cmake.org/cmake/help/latest/command/find_package.html#search-modes) which is the only one supported on Windows, assuming we also want to [include its libraries](https://rocm.docs.amd.com/en/latest/conceptual/cmake-packages.html#consuming-the-hip-api-in-c-code). The upper-case module-mode-seearched `find_package(HIP)` is used later for inclusion of macros such as `hip_add_library` and related macros.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137279
Approved by: https://github.com/jeffdaily
2024-12-03 03:26:01 +00:00
eellison
33573488d0 Make Dtypepropagation singleton (#141882)
Should fix compile time regression, it was doing fairly expensive meta programming in init and being instantiated multiple times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141882
Approved by: https://github.com/ezyang
ghstack dependencies: #139945, #140057, #141495
2024-12-03 03:15:16 +00:00
Benjamin Glass
f911361de1 Correctly specify size of sparse_csr tensors in maskedtensor binary ops (#134335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134335
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-12-03 02:55:57 +00:00
Aaron Gokaslan
08db735629 [BE]: Update mypy to 1.13.0 (#140808)
Update mypy to 1.13.0 . Should hopefully reduce linting time. Has support for orjson cache serialization which should improve mypy cache perf if orjson is installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140808
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-12-03 02:50:10 +00:00
Guilherme Leobas
34127fc688 Only reconstruct dict if needed (#141606)
Fixes #141452

This is a follow-up of PR #134876, which optimized dict reconstruct to codegen only if any value changed. In this PR we cover the general case and do not codegen any instruction if the dictionary remains the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141606
Approved by: https://github.com/zou3519
2024-12-03 02:22:34 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar
a6bea3d86d Fix DCe in training IR to reflect correct record function op (#141899)
Summary: The exit function is actually exit._recordFunction not exit.default

Test Plan: CI

Differential Revision: D66665359

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141899
Approved by: https://github.com/ydwu4
2024-12-03 01:59:37 +00:00
James Wu
d633cf1f55 Set remote cache version and backend type once in compilation metrics (#141707)
This is causing FbFxGraphRemoteCache.init to no longer be idempotent, i.e. only safe to call once per compile. AOTAutogradCache initializes a new remote cache for the forward and the backward.
Technically, we could make AOTAutogradCache smart and globally thread through a single FbFxGraphRemoteCache everywhere. But there's no reason to do so, as this class is just the handle to access the cache. Plus, it's very brittle for FbFxGraphRemoteCache to not be safe to call multiple times.

(Same problem, different fix of D66502138)

Differential Revision: [D66508492](https://our.internmc.facebook.com/intern/diff/D66508492/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141707
Approved by: https://github.com/ezyang
2024-12-03 01:49:11 +00:00
Yu, Guangye
77748ed8ec fix c10::Event UT failure on XPU backend (#141800)
# Motivation
Fix this UT failure introduced by https://github.com/pytorch/pytorch/pull/140865. The unrelated failure suppressed this UT failure.
It goes to happen since https://github.com/pytorch/pytorch/pull/141546 is landed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141800
Approved by: https://github.com/EikanWang
2024-12-03 01:34:42 +00:00
PyTorch MergeBot
09ce760fef Revert "Add missing data types at torch export serialization (#138561)"
This reverts commit 1ef1b3b391.

Reverted https://github.com/pytorch/pytorch/pull/138561 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/138561#issuecomment-2513343401))
2024-12-03 01:32:50 +00:00
Benjamin Glass
4959784dac Add API query for available per-process CUDA memory (#140620)
Certain `cpp_wrapper`-enabled tests were OOM-ing in the CI pipeline, with error messages suggesting that sufficient memory was accessible.  This ultimately resulted from an internal memory limitation that was not queryable in the API.  This PR adds querying for that limit.

Additionally, the failing tests had incorrect memory availability checks, and are updated with measured memory requirements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140620
Approved by: https://github.com/malfet, https://github.com/eqy
ghstack dependencies: #141367
2024-12-03 00:24:03 +00:00
Chris Sidebottom
5c33c9202f Skip test_cpu_repo.py::CPUReproTests::test_auto_zvec_vsx_simd on AArch64 (#141155)
The skipping logic clearly states it shouldn't be running on this architecture. The test then fails due to `VecNEON` returning `128` from `bit_width()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141155
Approved by: https://github.com/jgong5, https://github.com/desertfire, https://github.com/malfet
2024-12-03 00:19:06 +00:00
atalman
c17ba69ba5 [submodule] Revert "Adds support for accelerated sorting with x86-simd-sort (#127936) (#141901)
Looks like the original PR caused: https://github.com/pytorch/pytorch/issues/140590

Please see comment: https://github.com/pytorch/pytorch/issues/140590#issuecomment-2508704480

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141901
Approved by: https://github.com/andrewor14, https://github.com/malfet
2024-12-03 00:16:35 +00:00