Commit graph

81589 commits

Author SHA1 Message Date
chilli
8eb259fdc3 Added option to control number of kernel options displayed (#138788)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788
Approved by: https://github.com/drisspg
2024-12-02 00:35:29 +00:00
cyyever
fc74ec4989 [2/N] Avoid copy in std::get (#141826)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826
Approved by: https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-12-02 00:16:48 +00:00
Jason Ansel
b2fe1b9409 [inductor] Fix 3d tiling (#141709)
Fixes #141121

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709
Approved by: https://github.com/eellison
2024-12-01 19:47:41 +00:00
Roy Hvaara
90f19fee8a [MPS] Convert channels_last_3d to contiguous for input tensor in nn.Conv3d (#141780)
When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous.

Added a regression test that verifies the output by running the same op on the CPU.

I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context?

Fixes #141471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780
Approved by: https://github.com/malfet
2024-12-01 18:36:53 +00:00
Blaine Burton Rister
5deca07c0d [Inductor] Represent tiling as a dict (#141751)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions.

This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`.

Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first.

# Test plan
The existing CI provides good coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751
Approved by: https://github.com/jansel
2024-12-01 09:54:34 +00:00
cyy
96be048f06 [1/N] Avoid copy in std::get (#141812)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812
Approved by: https://github.com/Skylion007
2024-12-01 03:53:35 +00:00
Blaine Burton Rister
c2fa544472 [Inductor] move block pointer analysis to a new module (#141733)
# Summary

Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing.

# Test plan

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733
Approved by: https://github.com/jansel
2024-11-30 23:21:24 +00:00
Blaine Burton Rister
49fde426ba [Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738)
Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`.

Tested by the existing CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738
Approved by: https://github.com/jansel
2024-11-30 22:38:13 +00:00
Fabian Keller
394c339691 improve typings in unflatten (#141817)
A first follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230).

This PR improves the type annotations around `unflatten.py` which had been inaccurate due to the previously suppressed type checking on `torch.nn.Module`.

CC @Skylion007 @ezyang

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141817
Approved by: https://github.com/Skylion007
2024-11-30 22:12:15 +00:00
FFFrog
8a81f7a4b6 Refactor functions in functorch for functional (#141808)
As the title stated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141808
Approved by: https://github.com/Skylion007
2024-11-30 20:15:40 +00:00
atalman
0f3f801fc2 Add windows CUDA 12.6 nightly builds (#141805)
Windows AMI was published to prod. This PR adds CUDA 12.6 nightly builds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141805
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-11-30 14:39:47 +00:00
eqy
9532589b53 [CUDA][64-bit indexing] Support 64-bit indexing in distribution_elementwise_grid_stride_kernel (#141613)
For #141544
Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613
Approved by: https://github.com/Skylion007, https://github.com/ngimel
2024-11-30 06:55:02 +00:00
Edward Z. Yang
7fafaa9c82 Introduce CompiledAOTI (#141695)
Stacked on https://github.com/pytorch/pytorch/pull/141691

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141695
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683, #141685, #141688, #141689, #141691
2024-11-30 00:05:41 +00:00
Bob Ren
2f72635a5c automatic dynamic unspecialize float (#141647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647
Approved by: https://github.com/ezyang
2024-11-29 22:36:53 +00:00
cyy
e29dabbd71 Fix performance-unnecessary-copy-initialization (#141792)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141792
Approved by: https://github.com/Skylion007
2024-11-29 22:10:06 +00:00
chuanqiw
a23ac6f8bd [CD] Enable pypi dependencies both for XPU linux and Windows whls (#141135)
Enable xpu runtime pypi packages as dependencies of XPU CD wheels both for Linux and Windows.
Fixes https://github.com/pytorch/pytorch/issues/135867
Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141135
Approved by: https://github.com/atalman
2024-11-29 21:35:07 +00:00
George Wigley
44707b0667 Pass rounding_mode for div reference inputs through kwargs (#136308)
Previously, the reference inputs for div with rounding mode did not supply the rounding_mode keyword argument. This didn't match the sample inputs for this op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136308
Approved by: https://github.com/albanD

Co-authored-by: Xia, Weiwen <weiwen.xia@intel.com>
Co-authored-by: Bob Ren <bobren@meta.com>
Co-authored-by: Xilun Wu <12968408+XilunWu@users.noreply.github.com>
Co-authored-by: siahuat0727 <tansiahuat@gmail.com>
2024-11-29 21:28:24 +00:00
Ke Wen
ed092e2161 [2/N] Rename NCCLTraceBuffer to FlightRecorder (#141712)
Just name change. No behavior change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141712
Approved by: https://github.com/wconstab, https://github.com/fduwjj
ghstack dependencies: #141648
2024-11-29 21:15:31 +00:00
Zhengxu Chen
a8a570512b [export] Generate compatible thrift schema out of schema.py (#141611)
Summary: To make sure schema.py and schema.thrift are kept in sync, we use the int keys from thrift and use Python Annotated type to associate fields between thrift and schema.py. Later we will use this association to build a single source of truth between the schemas.

Test Plan: CI

Differential Revision: D66253157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141611
Approved by: https://github.com/yiming0416
2024-11-29 20:09:49 +00:00
cyyever
7dd9b5fc43 Fix NOLINTNEXTLINE (#141794)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-11-29 16:23:59 +00:00
PyTorch MergeBot
9e98b3d73c Revert "automatic dynamic unspecialize float (#141647)"
This reverts commit 1a32daeb17.

Reverted https://github.com/pytorch/pytorch/pull/141647 on behalf of https://github.com/atalman due to functorch/test_aotdispatch.py::TestAOTAutogradWithCache::test_inner_grad [GH job link](https://github.com/pytorch/pytorch/actions/runs/12080983316/job/33697901875) [HUD commit link](1a32daeb17) ([comment](https://github.com/pytorch/pytorch/pull/141647#issuecomment-2507980876))
2024-11-29 15:00:33 +00:00
siahuat0727
3c63e76b03 [PT2E Quantization] Fix RecursionError when prepare_pt2e graph with concat of the same node (#141651)
Fixes #129038

Related PR #129567

Here is the new PR against main, thanks! @jerryzh168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141651
Approved by: https://github.com/jerryzh168
2024-11-29 09:19:22 +00:00
Xilun Wu
ce572fedfc [dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow (#141532)
**Summary**
DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141532
Approved by: https://github.com/wconstab
ghstack dependencies: #141731, #141220, #141223
2024-11-29 07:59:34 +00:00
Xilun Wu
93cbb287c2 [dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called (#141223)
**Summary**
This PR proposes 4 changes to DTensor RNG management:
1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`.
2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling
```
world_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(2, 2, 2),
    mesh_dim_names=("pp", "dp", "tp"),
)
pp_mesh = world_mesh["pp"]
pp_rank = pp_mesh.get_local_rank()
spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd")  # this flattening is only needed if you need to call collective over this mesh
torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh)
```

In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize.

3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen.

4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous.

**Motivation**
tl;dr

1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel.
2. Users may want to set different seed on ranks in one device mesh.
3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it.

see detail in https://github.com/pytorch/pytorch/issues/140301

**Test**
`pytest test/distributed/_tensor/test_random_ops.py`
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141223
Approved by: https://github.com/wconstab
ghstack dependencies: #141731, #141220
2024-11-29 07:59:34 +00:00
Xilun Wu
7f5bc9dd87 [dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP (#141220)
**Summary**
The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results).

**Motivation**
`TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue.

**Impact**
`TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`).

For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant.

For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence.

**Test**
1-d model weight meta init:
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init`

2-d model weight meta init:
`pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init`

TP model weight init test:
`pytest test/distributed/tensor/parallel/test_tp_random_state.py`

FSDP+TP model weight init test:
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141220
Approved by: https://github.com/wconstab
ghstack dependencies: #141731
2024-11-29 07:59:26 +00:00
Xilun Wu
c55191f3a2 [dtensor][random] add 1d and 2d model meta init tests (#141731)
**Summary**
Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks.

**Test**
`pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731
Approved by: https://github.com/wconstab
2024-11-29 07:59:20 +00:00
Bob Ren
1a32daeb17 automatic dynamic unspecialize float (#141647)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647
Approved by: https://github.com/ezyang
2024-11-29 07:53:53 +00:00
Xia, Weiwen
9827d677b4 [Quant][PT2E][X86] annotate and convert for linear_dynamic_fp16 (#141480)
Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer`
After `convert_pt2e`, the pattern will be
```
  x
  |
linear <- to_fp32 <- to_fp16 <- w
```

**Test plan**
```
pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-11-29 07:48:39 +00:00
Yang Wang
b7a45dbae3 Add monitor script (#141438)
# Overview
Add monitor script to collect system-level utilization data during CI tests.
Currently all monitoring scripts are disabled.

# Details
- Add flag to customize the time intervals for logging
- Enable multiple GPU utilization logging

# Next step
enable monitor scritpt in non-perf-test workflows

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438
Approved by: https://github.com/huydhn
2024-11-29 04:14:31 +00:00
Roy Hvaara
4d5c096a55 [MPS] Add autocast rule for SDPA (#141776)
Fixes #141774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141776
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-11-29 03:34:03 +00:00
Edward Z. Yang
b97a786125 Inline compile_to_fn at its only call site (#141691)
Stacked on https://github.com/pytorch/pytorch/pull/141689

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141691
Approved by: https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685, #141688, #141689
2024-11-29 01:15:38 +00:00
Edward Z. Yang
9e4723cc6e Unify post_compile1 and CompiledFxGraph constructor (#141689)
Stacked on https://github.com/pytorch/pytorch/pull/141688

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141689
Approved by: https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685, #141688
2024-11-29 01:15:38 +00:00
Edward Z. Yang
29326b9d29 Hoist post_compile1 into fx_codegen_and_compile (#141688)
Stacked on top of https://github.com/pytorch/pytorch/pull/141685

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141688
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #141681, #141683, #141685
2024-11-29 01:15:31 +00:00
Edward Z. Yang
cf3daf723f Unify cache disable and cache bypass paths (#141685)
I was constantly annoyed at the fact that we had a separate else branch for when cache was disabled which was distinct from when cache was bypassed. This diff gets rid of the disabled cache branch, so we use the same logic for bypass/disable. I actually think this change probably didn't actually matter much for the POC but I think it's cleaner.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141685
Approved by: https://github.com/aorenste
ghstack dependencies: #141681, #141683
2024-11-29 01:15:24 +00:00
Aaron Gokaslan
7224cd4471 [BE]: Update 12.6 builds to CUDA 12.6.3 (#141433)
Update CUDA 12.6 to Update 3 and make cusparse-lt 0.6.3? #141365 Was going to leave some comments on #141365, but though it was just faster to open a PR here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141433
Approved by: https://github.com/atalman
2024-11-28 22:01:47 +00:00
Richard Barnes
ae6519cb74 [codemod] c10::string_view -> std::string_view in fields (#141736)
Summary: `c10::string_view` is being removed, so we need to migrate.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D65830276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141736
Approved by: https://github.com/Skylion007
2024-11-28 21:35:53 +00:00
Ivan Zaitsev
09a3eddc07 Revert #141066 and #141494 (#141721)
manual revert due to merge conflicts

note: #141494 was reverted out of order blocking automatic revert of #141066

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141721
Approved by: https://github.com/avikchaudhuri
2024-11-28 20:18:19 +00:00
PyTorch MergeBot
d08bd6d627 Revert "Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)"
This reverts commit 8a3317cd41.

Reverted https://github.com/pytorch/pytorch/pull/141587 on behalf of https://github.com/atalman due to inductor/test_torchinductor_strided_blocks.py::TritonBlockPointerTestGPU::test_expand_broadcast_x_size0_y_size0_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/12072823884/job/33669367764) [HUD commit link](8a3317cd41) ([comment](https://github.com/pytorch/pytorch/pull/141587#issuecomment-2506690095))
2024-11-28 19:41:03 +00:00
Pruthvi Madugundu
907c31f529 [ROCm] devtoolset / GCC11 upgrade on manylinux images - 1b of 2 (docker images) (#141609)
Upgrade gcc version from 9 to 11 on ROCm manylinux images.

Needed for #141423 since almalinux8-based manylinux2_28 images for ROCm (#140681) installs gcc-toolset-9, which installs [gcc 9.2.1](https://pkgs.org/download/gcc-toolset-9-gcc-c++). However, PyTorch CMakeLists.txt enforces a [minimum gcc version of 9.3](5318bf8baf/CMakeLists.txt (L61)).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141609
Approved by: https://github.com/jeffdaily

Co-authored-by: Jithun Nair <jithun.nair@amd.com>
2024-11-28 19:18:09 +00:00
Bludator
f4187050fe [ONNX] Remove special handling of torchvision.ops imports in onnx export (#141569)
Fixes #141568

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141569
Approved by: https://github.com/titaiwangms

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>
2024-11-28 18:05:40 +00:00
Edward Z. Yang
6d204cb5ed Hoist set_feature_use out of conditional, rename some variables (#141683)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141683
Approved by: https://github.com/jamesjwu, https://github.com/jansel
ghstack dependencies: #141681
2024-11-28 17:43:11 +00:00
Edward Z. Yang
229daf7470 Inline FxGraphCache.load into its sole call site (#141681)
I need to restructure the body of FxGraphCache.load with the outer if-else in its call site, so inline it goes!

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141681
Approved by: https://github.com/jamesjwu, https://github.com/jansel
2024-11-28 17:43:11 +00:00
chuanqiw
b9a8df4bdd [CD] Add triton xpu build back (#141775)
Triton xpu build was stopped by https://github.com/pytorch/pytorch/pull/139206 temporally to wait triton xpu upgrade PR https://github.com/pytorch/pytorch/pull/137886 landed.

Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141775
Approved by: https://github.com/atalman
2024-11-28 17:37:42 +00:00
cyy
6b430c26bd Fix bugprone-argument-comment (#141777)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141777
Approved by: https://github.com/Skylion007
2024-11-28 16:56:50 +00:00
Mwiza Kunda
8a3317cd41 Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587)
This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587
Approved by: https://github.com/jansel
2024-11-28 16:45:25 +00:00
Nan Zhang
5aacfa037b [Inductor] fix broadcast logic for Triton (#141027) (#141693)
Summary:

Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction.

Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases.

Reviewed By: blaine-rister

Differential Revision: D65518033

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693
Approved by: https://github.com/blaine-rister
2024-11-28 16:38:25 +00:00
Laith Sakka
f684dbd002 Try to simplify FloorDiv axioms implications when needed during evaluations. (#141267)
Summary:
This very much the same solution proposed by bobrenjc93 except that it restrict it to expressions and axioms that have FloorDiv, since those are the only ones that could have became CleanDiv. and the only one that can changes as shape env changes.

This also does not break torchrec benchmarks, it might be worth it to know why the generalization of this does break the torchrec benchmarks, but we could just be hitting another bug or NYI situation.

ovearhead?
None on
```
buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000
```

Differential Revision: D66307433

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141267
Approved by: https://github.com/ezyang
2024-11-28 15:35:35 +00:00
chuanqiw
d49f0bf466 [CI] Fix xpu linux ci build environment duplicated issue (#141546)
We found that there are duplicated build environments in XPU linux ci test, it led to test jobs may download wrong pytorch build artifact file. Refer https://github.com/pytorch/pytorch/actions/runs/12023238798/job/33518351906#step:14:633

Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141546
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-11-28 14:21:21 +00:00
atalman
0f261e8f77 Add Manylinux2014 and Manylinux 2.28 config to triton builds. Run auditwheel on triton binaries (#141704)
This PR combines Manylinux 2_28 and Manylinux 2014  builds of triton under one workflow. This is required in order to support torch cpu, cuda 118, cuda 12.4 wheels built with Manylinux 2014 and torch cuda 12.6 wheels built with Manylinux 2_28.

Manylinux 2014 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl``
Manylinux 2_28 wheels:
``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141704
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn
2024-11-28 13:40:39 +00:00
eellison
f83361b274 inductor dtype propagation fixes (#141495)
- Add in upcast_compute_type on creation of new tensors (loads, constants)
- Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr.
- bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16.
- for masked, avoid calling dtype propagation and just use output dtype.

Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32.

Follow ups:

- We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr.
- Be more consistent on our index expr dtype printing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495
Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang
ghstack dependencies: #139945, #140057
2024-11-28 11:39:38 +00:00