Commit graph

80420 commits

Author SHA1 Message Date
PyTorch MergeBot
709752e0bb Revert "[AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)"
This reverts commit 293fbb42d2.

Reverted https://github.com/pytorch/pytorch/pull/139154 on behalf of https://github.com/desertfire due to cpu_aot_inductor_amp_freezing fails ([comment](https://github.com/pytorch/pytorch/pull/139154#issuecomment-2452983651))
2024-11-02 13:04:00 +00:00
Edward Z. Yang
f6be44c74e Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-02 11:50:11 +00:00
Ke Wen
55038aa661 [PGNCCL] Make sure we do not use split for P2P comm creation (#139013)
Resolve comment https://github.com/pytorch/pytorch/pull/138527#issuecomment-2438613172

There was a split-vs-P2P bug:
When P2P comm creation invokes `getNCCLComm`, it may see a `split_from` options which is meant for the previous PG creation. Then the P2P comm creation may use `ncclCommSplit` and hang, because not all ranks join this call. The bug slips previously/today because there is no CI test with the following recipe: eager init + new group + P2P in that new group.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139013
Approved by: https://github.com/shuqiangzhang
2024-11-02 07:47:55 +00:00
PyTorch MergeBot
2a3fe06ce0 Revert "[Partitioner] Enumerate partitions by iterating partition ids (#136598)"
This reverts commit 39ec5a20ea.

Reverted https://github.com/pytorch/pytorch/pull/136598 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails an executorch test https://github.com/pytorch/executorch/blob/main/exir/backend/test/test_graph_partition.py#L114-L175 ([comment](https://github.com/pytorch/pytorch/pull/136598#issuecomment-2452903705))
2024-11-02 07:19:22 +00:00
PyTorch MergeBot
f3238106fd Revert "Allow inplacing buffer when other users are inconsequential (#138383)"
This reverts commit 030f70b40b.

Reverted https://github.com/pytorch/pytorch/pull/138383 on behalf of https://github.com/huydhn due to Sorry for reverting this again, but I think it has a test failing internally and also on ROCm ([comment](https://github.com/pytorch/pytorch/pull/138383#issuecomment-2452898229))
2024-11-02 06:53:48 +00:00
PyTorch MergeBot
0863d6a08e Revert "[inductor] Remove SIMDKernel.last_usage (#139364)"
This reverts commit 286d3ce266.

Reverted https://github.com/pytorch/pytorch/pull/139364 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:11 +00:00
PyTorch MergeBot
9331640e26 Revert "[inductor] Remove Node.last_usage mutation (#139365)"
This reverts commit 1e934b473c.

Reverted https://github.com/pytorch/pytorch/pull/139365 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
PyTorch MergeBot
dc4b459737 Revert "[inductor] Move remove_kernel_local_buffers to Kernel (#139370)"
This reverts commit b57b4b7f9b.

Reverted https://github.com/pytorch/pytorch/pull/139370 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
PyTorch MergeBot
66a401c9e1 Revert "[inductor] Simplify remove_kernel_local_buffers (#139452)"
This reverts commit 73c0762a34.

Reverted https://github.com/pytorch/pytorch/pull/139452 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
PyTorch MergeBot
98e11b0021 Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)"
This reverts commit c53beab377.

Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337))
2024-11-02 06:49:10 +00:00
Bob Ren
fdd298dcb7 add hex method on SymFloat (#139451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139451
Approved by: https://github.com/ezyang
2024-11-02 05:33:19 +00:00
PyTorch MergeBot
8d1eaa3da6 Revert "Profile guided optimization for automatic_dynamic (#139001)"
This reverts commit a6630bcf87.

Reverted https://github.com/pytorch/pytorch/pull/139001 on behalf of https://github.com/ezyang due to internal code triggers import cycle ([comment](https://github.com/pytorch/pytorch/pull/139001#issuecomment-2452833882))
2024-11-02 03:38:15 +00:00
drisspg
540f3ef9b1 Fix flex_decode to build offsets off of strides (#139516)
Fixes PR: https://github.com/pytorch/pytorch/issues/139462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139516
Approved by: https://github.com/Chillee
2024-11-02 03:17:46 +00:00
Bin Bao
293fbb42d2 [AOTI] Switch OSS dashboard to use aoti_compile_and_package (#139154)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139154
Approved by: https://github.com/angelayi
ghstack dependencies: #139153
2024-11-02 03:10:05 +00:00
Bin Bao
a46a79fe92 [AOTI] Ignore .o files in package_aoti (#139153)
Summary: There is no point to package .o files since a .so file is included in that package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139153
Approved by: https://github.com/angelayi
2024-11-02 03:10:05 +00:00
Jason Ansel
c53beab377 [inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) (#139523)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364, #139365, #139370, #139452
2024-11-02 03:04:22 +00:00
Justin Chu
387b120549 [ONNX] Remove type promotion rule for pow (#139527)
ONNX supports different input types in Pow, so type promotion is not needed.

The resulting graph is the following:

```py
ONNXProgram(
    model=
        <
            ir_version=9,
            opset_imports={'': 18, 'pkg.onnxscript.torch_lib.common': 1},
            producer_name='pytorch',
            producer_version='2.6.0a0+git59a1af5',
            domain=None,
            model_version=None,
        >
        graph(
            name=main_graph,
            inputs=(
                %"x"<FLOAT16,[3]>
            ),
            outputs=(
                %"pow_1"<FLOAT16,[3]>
            ),
        ) {
            0 |  # node_Constant_0
                 %"val_0"<?,?> ⬅️ ::Constant() {value=Tensor<FLOAT,[]>(array(2., dtype=float32), name=None)}
            1 |  # node_Pow_1
                 %"pow_1"<FLOAT16,[3]> ⬅️ ::Pow(%"x", %"val_0")
            return %"pow_1"<FLOAT16,[3]>
        }
...
    ,
    exported_program=
        ExportedProgram:
            class GraphModule(torch.nn.Module):
                def forward(self, x: "f16[3]"):
                     # File: /workspace/pytorch/test/onnx/exporter/test_small_models_e2e.py:53 in forward, code: return x**2.0
                    pow_1: "f16[3]" = torch.ops.aten.pow.Tensor_Scalar(x, 2.0);  x = None
                    return (pow_1,)

        Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='pow_1'), target=None)])
        Range constraints: {}

)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139527
Approved by: https://github.com/titaiwangms
2024-11-02 02:19:50 +00:00
Matthew Sterrett
7e65060410 Adds support for accelerated sorting with x86-simd-sort (#127936)
Adds x86-simd-sort as a submodule to accelerate sorting for 32-bit and 64-bit datatypes when AVX2 or AVX512 are available.

For contiguous data, this can be over a 10x speedup for large arrays. For discontiguous data, it can give over a 4x speedup with larger arrays. These benchmarks were gathered on a Skylake system (7900x), limited to 8 threads.

<details>
<summary><b>Contiguous Benchmarks</b></summary>

```
float32, normally distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.150844336    6.886271477    7.132277489    1.038420335    1.002603214
128            9.208030939    8.478154898    7.846915245    1.086089019    1.173458697
1024           37.79037627    23.60707456    16.44122627    1.600807257    2.298513241
10000          714.7355628    203.9921844    105.5683001    3.503739934    6.770361577
100000         8383.074408    721.6333354    465.3709247    11.61680593    18.01374766
1000000        97124.31945    5632.054572    3920.148401    17.24491803    24.77567416
10000000       1161974.907    86070.48988    71533.82301    13.50027063    16.24371323

int32_t, uniformly distributed (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             7.203208685    6.92212224     7.014458179    1.040606975    1.026908779
128            8.972388983    8.195516348    7.592543125    1.094792396    1.18173698
1024           32.77489477    23.6874548     15.36617105    1.383639359    2.132925285
10000          607.8824128    193.3402024    99.25090471    3.144107667    6.124703997
100000         523.9384684    608.1836536    442.3166784    0.861480682    1.184532472
1000000        5211.348627    5271.598405    3518.861883    0.988570871    1.480975611
10000000       133853.6263    81463.05084    67852.97394    1.643120714    1.972700952
```

</details>

Note that the int32_t sort is accelerated by FBGEMM's radix sort for larger arrays, but this only handles contiguous data and in one sorting direction.

<details>
<summary><b>Discontiguous Benchmarks</b></summary>

```
float, normal distributed, discontiguous in sorted dimension (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.836543679    4.011214256    3.84376061     0.956454439    0.99812243
128            5.755310194    5.755723127    4.820394962    0.999928257    1.193949923
1024           49.46946019    24.78790785    15.47874362    1.995709379    3.195960952
10000          665.2505291    236.6165959    143.9490662    2.811512551    4.621429974
100000         4328.002203    1329.001212    818.3516414    3.256582586    5.288682743
1000000        47651.5018     16693.72045    11827.39551    2.854456677    4.028909133
10000000       556655.1288    236252.6258    184215.9828    2.356185998    3.021752621

int32_t, uniformly distributed, discontiguous in sorted dimension  (in microseconds)
size           Default        AVX2           AVX512         Default/AVX2   Default/AVX512
16             3.817994356    3.878117442    3.770039797    0.984496837    1.012719908
128            5.578731397    5.577152082    4.716770534    1.000283176    1.182743862
1024           43.3412619     23.61275801    14.55446819    1.835501887    2.977866408
10000          634.3997478    224.4322851    133.9518324    2.826686667    4.736028889
100000         4084.358152    1292.363303    781.7867576    3.16037924     5.22438902
1000000        46262.20465    16608.35284    11367.51817    2.785478192    4.06968381
10000000       541231.9104    235185.1861    180249.9294    2.301301028    3.002674742
```

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127936
Approved by: https://github.com/jgong5, https://github.com/peterbell10, https://github.com/sanchitintel
2024-11-02 02:14:01 +00:00
Chen, Zejun
edd3f5a94d [profiler] fix a building warning by adding USE_KINETO namespace for setTraceID (#139461)
Fix: https://github.com/pytorch/pytorch/issues/139460
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139461
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/sraikund16
2024-11-02 01:02:29 +00:00
Angela Yi
092fe2f422 Handle nan case when checking mutations (#139483)
Test Plan: PT2 readiness models

Differential Revision: D65340986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139483
Approved by: https://github.com/zou3519
2024-11-02 00:49:05 +00:00
William Wen
b71e813bce [dynamo, 3.13] fix bytecode nop tests (#139323)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139323
Approved by: https://github.com/jansel
2024-11-02 00:39:36 +00:00
Bin Bao
8c17830dea [AOTI] Unify how weights are stored as data section (#139471)
Summary: https://github.com/pytorch/pytorch/pull/118076 introduced a cleaner way to link weights as a data section for macos. Unify the code by adopting that approach for Linux as well.

Test Plan: CI

Differential Revision: D65302273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139471
Approved by: https://github.com/chenyang78
2024-11-02 00:23:24 +00:00
PyTorch UpdateBot
aa54b2467f [executorch hash update] update the pinned executorch hash (#139133)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139133
Approved by: https://github.com/pytorchbot
2024-11-02 00:14:47 +00:00
eellison
ee2f8a50d3 Class rename (#139490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139490
Approved by: https://github.com/exclamaforte, https://github.com/zou3519
ghstack dependencies: #139295
2024-11-02 00:10:17 +00:00
PyTorch MergeBot
c95adb9c5b Revert "use more elements per thread for narrow dtypes (#139449)"
This reverts commit f5b9e725d1.

Reverted https://github.com/pytorch/pytorch/pull/139449 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but a bunch of tests are failing after it lands, it looks like a landrace ([comment](https://github.com/pytorch/pytorch/pull/139449#issuecomment-2452723863))
2024-11-01 23:42:16 +00:00
PyTorch MergeBot
b617d4813c Revert "fix dynamo tracking numpy 2 ops (#138686)"
This reverts commit 124eac255e.

Reverted https://github.com/pytorch/pytorch/pull/138686 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I am seeing inductor failure with hf_BigBird number of graph breaks after it lands ([comment](https://github.com/pytorch/pytorch/pull/138686#issuecomment-2452718164))
2024-11-01 23:34:06 +00:00
Nikita Shulga
77b72d686e [BE][MPS] Make metal shaders compile cleanly (#139522)
I.e. without warnings, by deleting dead code and fixing one
signed-unsigned comparison warning

Also, pass `-Werror` to metal compiler if WERROR options is set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139522
Approved by: https://github.com/Skylion007
2024-11-01 23:22:47 +00:00
eellison
2382b3b6d8 [Easy] Add joint graph passes, fallback_random to bisector (#139295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139295
Approved by: https://github.com/zou3519, https://github.com/exclamaforte
2024-11-01 23:21:53 +00:00
Gabriel Ferns
1e73842029 Refactor FxGraphDrawer to use HTML-like labels (#137726)
Fixes https://github.com/pytorch/pytorch/issues/137499
Testing: Added a new unit test to make sure that the regression case succeeds.
I'm debating about whether to make the borders visible. I'm partial to no borders, but it might make it harder for some people to read?
![68a2b0e3-orig_fx_graph_diagram](https://github.com/user-attachments/assets/fbc2fd98-9e76-488e-8ebe-c64fbf206932)
Vs.
![2bfe1c4f-orig_fx_graph_diagram](https://github.com/user-attachments/assets/b6bc88ba-dda2-4cf7-84ac-a615e1e03a74)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137726
Approved by: https://github.com/eellison, https://github.com/malfet
2024-11-01 23:19:50 +00:00
David Berard
60542eeb33 [inductor] set sanitize_overflow=False for triton kernels (#139502)
In upstream triton, https://github.com/triton-lang/triton/pull/4589 introduces overflow checks. However, overflow checks likely add some overhead, and have some correctness bugs at the moment (e.g. https://github.com/triton-lang/triton/pull/5033). Let's set `sanitize_overflow=False` but keep `debug=True` so that we can keep using device_assert but without the additional asserts added by `sanitize_overflow`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139502
Approved by: https://github.com/bertmaher
2024-11-01 23:10:21 +00:00
Huy Do
da395384a2 Delete Windows GPU jobs in periodic (#139336)
As an outcome of https://fburl.com/gdoc/voce5o06, we could stop running Windows GPU tests on periodic pending the green light from MS. No one is monitoring these jobs atm.

We already have Windows CUDA and CPU build jobs in trunk.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139336
Approved by: https://github.com/ZainRizvi, https://github.com/wdvr, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-01 22:26:22 +00:00
Shuqiang Zhang
4c64a7f33f [pgnccl] add a restart test for PGs in blocking mode (#139496)
Summary:
Restarting (aborting and re-initialize a PG) is a basic need if we want
to achieve in-process restart of PGs without tearing down the whole
process.

Add this tests to verify that this is supported by current NCCL.
Note that this restart test passes steadily only for blocking mode for now.
In nonblockin mode. There is problem in either nccl init or abort that
needs further investigation
Test Plan:
new UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139496
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2024-11-01 22:13:37 +00:00
Huy Do
0b13bdd877 Delete parallelnative jobs in periodic (#139328)
As an outcome of https://fburl.com/gdoc/voce5o06, we can now clean up parallelnative build and test jobs in periodic.  There is not much value in running them anymore
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139328
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-11-01 22:05:13 +00:00
Huy Do
8eb75cbad6 Delete iOS jobs from periodic (#139345)
As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up iOS lite interpreter jobs on PyTorch CI. There is not much value in running them anymore.

It's stated in https://github.com/pytorch/ios-demo-app/blob/master/README.md that ExecuTorch is the replacement now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139345
Approved by: https://github.com/wdvr, https://github.com/malfet
2024-11-01 22:04:27 +00:00
Huy Do
8ad76efb8d Delete Vulkan jobs from periodic (#139354)
As an outcome of https://fburl.com/gdoc/voce5o06, we can clean up this job now as the backend has been marked as deprecated https://pytorch.org/tutorials/prototype/vulkan_workflow.html to be replace by ExecuTorch Vulkan delegate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139354
Approved by: https://github.com/wdvr, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-01 22:03:12 +00:00
Mikayla Gawarecki
a979318ef7 Add section to serialization note re weights_only (#139433)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139433
Approved by: https://github.com/malfet
ghstack dependencies: #138936, #139221
2024-11-01 21:51:50 +00:00
Nikita Shulga
a1f854f270 [MPS] Compile kernels into Metallib (#138636)
PyTorch MPS backend for the most part relies on MPSGraph to provide specific operations, but recently more and more often one had to implement custom kernel here that were simply embedded in the operator codebase and were compiled directly using [`- id<MTLLibrary>newLibraryWithSource:options:error:`](https://developer.apple.com/documentation/metal/mtldevice/1433431-newlibrarywithsource) (first metal kernel to MPS backend was added in https://github.com/pytorch/pytorch/pull/82307 )
Later on, as number of operator grew, those were refactored into `MetalShaderLibrary` convenience class (see  https://github.com/pytorch/pytorch/pull/125550 )

But as number of kernels keeps growing, it's time to make a next step and properly compile them into `.metalib`

This PR does exactly that by:
 - Moving shader sources into separate .metal files
 - Adds check on whether full Xcode installed or just DeveloperTools
 - If full Xcode is installed, compiles and links shaders into .metallib for Metal-3.0(Available on MacOS 13) and Metal-3.1 standard (available on MacOS 14, can use bfloat) and bundles both using `-sectcreate` linker option and `getsectiondata` API call. `metallib_dummy.cpp` file is used to properly express dependencies between metallib build and torch_cpu link stages. Logic for generating metallibraries is loosely based on https://github.com/ml-explore/mlx/blob/main/mlx/backend/metal/kernels/CMakeLists.txt.
 - If only DeveloperTools CLI is installed, automatically wraps .metal into `_metallib.h` that contains shader source wrapped in `MetalShaderLibrary`

Bulk of changes introduced in this PR are just moving code around. I.e. for every file that contains non-templated shader definition in `aten/src/ATen/native/mps/operators` folder, corresponding `.metal` file is created in `aten/src/ATen/native/mps/kernels` folder and embedded shader definition is replaced with the following
```cpp
#ifndef PYTORCH_JIT_COMPILE_SHADERS
static auto& lib = MetalShaderLibrary::getBundledLibrary();
#else
#include <ATen/native/mps/OpName_metallib.h>
#endif
```

Some historical stats:
| PyTorch Version  | Number of shaders in MPS | Ops added |
| ------------- | ------------- | ---- |
| 1.12  | 0  | |
| 1.13  | 2  | bitwise_ops and  index.out |
| 2.0  | 4  | cross repeat and view)  |
| 2.1  | 9   | unary_ops, histogram, renorm, binary_ops |
| 2.2  | 11   | gamma and bucketization |
| 2.3  | 12  | naive_matmul (to workaround crash) |
| 2.4 | 13 | quantized_mm |
| 2.5 | 14 | fused_adam |

Pros:
  - Better code structure/readability
  - Eventually allows one to use shared headers (and implement something like `TensorIterator`)
  - Faster runtime (as compilation is done ahead of time) and perhaps better optimized compiled kernels

Cons:
  - Build process is a bit more complicated that it used to be
  - Need to maintain two codepath (as our CI builders only has DeveloperTools installed)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138636
Approved by: https://github.com/manuelcandales
2024-11-01 21:47:20 +00:00
Edward Z. Yang
a6630bcf87 Profile guided optimization for automatic_dynamic (#139001)
Previously: https://github.com/pytorch/pytorch/pull/138052 but the implementation is done from scratch, so I open a new PR.

This implements the ability to save and load profiles of automatic dynamic decisions, so on subsequent runs we can directly make something automatically dynamic. Unlike the previous implementation, this cache is never enabled by default; instead, you have to specify a "job id" that says it's OK to share results. We will be able to automatically populate this id for internal MAST jobs but for generic OSS users you will have to explicitly opt into it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Differential Revision: [D65065497](https://our.internmc.facebook.com/intern/diff/D65065497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139001
Approved by: https://github.com/oulgen
2024-11-01 21:43:25 +00:00
Xuan Zhang
9c2ffce71a add condition for freeable input buffer (#139480)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139480
Approved by: https://github.com/yf225
ghstack dependencies: #139396
2024-11-01 21:15:40 +00:00
Huy Do
18f3b3c991 Clean up Android jobs in CI (#139350)
As an outcome of https://fburl.com/gdoc/voce5o06 and confirm with @iseeyuan, we can now clean up Android lite interpreter jobs on PyTorch CI. There is not much value in running them anymore.

It's stated in https://github.com/pytorch/android-demo-app/blob/master/README.md that ExecuTorch is the replacement now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139350
Approved by: https://github.com/ZainRizvi
2024-11-01 21:10:19 +00:00
Sam Larsen
c412a42ae2 [pt2 logging] move remote cache get/put logging up one level (#139423)
Summary: I need to refactor the way we record CompilationMetrics. It will be much easier to do in OSS and having the relevant timing code in the OSS area of the codebase will make this much easier. I doubt this meaningfully changes the values we see.

Test Plan: Made sure samples show up: https://fburl.com/scuba/dynamo_compile/sandbox/c38zjq0x

Differential Revision: D65290089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139423
Approved by: https://github.com/oulgen
2024-11-01 21:06:59 +00:00
Animesh Jain
0e57f2b589 [invoke_subgraph] Change the joint_graph output signature to simplify min-cut partitioner (#139326)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139326
Approved by: https://github.com/zou3519
ghstack dependencies: #139216, #139130
2024-11-01 21:02:32 +00:00
Animesh Jain
6a268c3fbb [invoke_subgraph] Generate fake_inputs correctly (#139130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139130
Approved by: https://github.com/zou3519
ghstack dependencies: #139216
2024-11-01 21:02:32 +00:00
Animesh Jain
4c756cacfd [invoke_subgraph] Re-enable fake tensor model in the fake tensor impl (#139216)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139216
Approved by: https://github.com/zou3519
2024-11-01 21:02:32 +00:00
Justin Chu
5d67efb809 [ONNX] New registration API (#135403)
The ONNX custom ops registration API.

## Design

1. Create a "custom_translation_table: dict[Callable, Sequence[Callable] | Callable" parameter for specifying extra functions
2. Use a callable as the key to support all possible call_function targets in the fx graph
3. Allow a callable or a Sequence of callables as values.
		- When there is a single callable, it is the translation function for the op
		- When there is a Sequence of callable, the exporter's dispatcher will dispatch to these callables in order based on input dtypes.
		- The translation functions can be a plain python function that calls onnxscript ops (traced), or an onnxscript function.
		- Complex input support: We create special type annotations for annotating real representations of complex inputs, which are needed to handle complex computation in the ONNX graph, as we don't have any ops in ONNX that handle complex inputs. The dispatcher will have knowledge of these newly created type annotations and dispatch correctly. The complex functions will be in the same overload pool as the real functions.

```py
torch.onnx.export(dynamo=True,
	custom_translation_table = {
	torch.ops.aten.add: [overload1, overload2],
	torch.sym_not: sym_not_onnx,
})
```
Support for functions that handles complex inputs will be in separate PRs.

fixes https://github.com/pytorch/pytorch/issues/138391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135403
Approved by: https://github.com/titaiwangms
2024-11-01 20:58:54 +00:00
Natalia Gimelshein
f5b9e725d1 use more elements per thread for narrow dtypes (#139449)
Fix perf issue for narrow type by accessing more elements per thread

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139449
Approved by: https://github.com/Chillee, https://github.com/eqy
2024-11-01 20:41:13 +00:00
Jason Ansel
73c0762a34 [inductor] Simplify remove_kernel_local_buffers (#139452)
I plan to reuse `can_buffer_be_removed_through_fusion` in some heuristics.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139452
Approved by: https://github.com/shunting314
ghstack dependencies: #139364, #139365, #139370
2024-11-01 20:36:39 +00:00
Bert Maher
dcdcb8b364 Avoid overflow in float32-to-int32 test (#139489)
Summary:

Triton has added some integer overflow detection when kernels are compiled with
`debug=True`, and this test results in integer overflow (2.0 is 0x40000000,
times 2 is 0x80000000 which overflows a signed int32).

Assertion `int32 overflow detected for operation mul` failed

Fixes #139479

Test Plan:
```
python inductor/test_torchinductor.py -k test_float32_to_int32_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/139489
Approved by: https://github.com/eellison, https://github.com/jansel, https://github.com/chenyang78
2024-11-01 20:22:19 +00:00
Yifu Wang
0dbc284a72 [SymmetricMemory] expose signal_pads as tensors in Python (#138754)
## This Stack

This stack does the following things to support `xformers`-style, comm-aware Triton kernels:
- Exposes `signal_pad`s as tensors in Python
- Adds a binding for `cuMemsetAsync`

These in combination aims to provide users with more flexibility to express custom signaling/synchronization patterns.

## This PR

```python
# Obtain the signal pad of the specified peer rank as a tensor.
# If both shape and dtype are unspecified, the returned tensor will be a
# 1d uint32 tensor, which is most natural for signaling purposes.
symm_mem.get_signal_pad(peer_rank)

# If only shape is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank)[:shape.numel()].view(shape)
symm_mem.get_signal_pad(peer_rank, shape)

# If only dtype is specified, it is equivalent to:
# symm_mem.get_signal_pad(peer_rank).view(dtype)
symm_mem.get_signal_pad(peer_rank, dtype=dtype)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138754
Approved by: https://github.com/weifengpy, https://github.com/lw
2024-11-01 20:17:15 +00:00
Haifeng Jin
124eac255e fix dynamo tracking numpy 2 ops (#138686)
Fixes #136559
As we upgrade to NumPy 2, torch falsely filtered out `numpy.random` as unsupported in dynamo tracking.
This PR changes the filtering rules to include them while keeping behavior with numpy 1 unchanged.

Before this PR, the following tests failed:

```
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_functions.py -k FunctionTests.test_numpy_random
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/dynamo/test_unspec.py -k UnspecTests.test_to_tensor
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k FakeTensorTest.test_export_numpy
PYTORCH_TEST_WITH_ASAN=1 PYTORCH_TEST_WITH_UBSAN=1 python test/test_fake_tensor.py -k PropagateRealTensorsFakeTensorTest.test_export_numpy_propagate_real_tensors
```

With this PR, the supported/unsupported ops in NumPy 1 are not changed.
For NumPy 2, only the `numpy.random` ops that are already supported with NumPy 1 are added to the supported list.

I used the following scripts to check the differences before and after the change for both NumPy 1 & 2.
The output is empty for NumPy 1 since there is no change.
The output is a list of `numpy.random` that considered supported for NumPy 2.

```py
from torch._dynamo import trace_rules
import numpy as np

def new_numpy_function_ids():
    unsupported_funcs = {"seed", "ranf", "get_bit_generator", "RandomState", "set_bit_generator", "sample"}

    def is_supported(k, v, mod):
        if not callable(v):
            return False
        if not getattr(v, "__module__", None):
            return True
        if v.__module__ == mod.__name__:
            return True
        if v.__module__ == "numpy.random.mtrand" and mod.__name__== "numpy.random" and k not in unsupported_funcs:
            return True
        return False
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        for k, v in mod.__dict__.items():
            if is_supported(k, v, mod):
                rv[id(v)] = f"{mod.__name__}.{k}"
    return rv

def old_numpy_function_ids():
    rv = {}
    for mod in trace_rules.NP_SUPPORTED_MODULES:
        rv.update(
            {
                id(v): f"{mod.__name__}.{k}"
                for k, v in mod.__dict__.items()
                if callable(v)
                and (getattr(v, "__module__", None) or mod.__name__) == mod.__name__
            }
        )
    return rv

rv1 = set(old_numpy_function_ids().values())
rv2 = set(new_numpy_function_ids().values())

for v in (rv1 - rv2):
    print(v)
print("****")
for v in (rv2 - rv1):
    print(v)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138686
Approved by: https://github.com/lezcano, https://github.com/williamwen42
2024-11-01 19:51:40 +00:00