pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
chilli	8eb259fdc3	Added option to control number of kernel options displayed (#138788 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138788 Approved by: https://github.com/drisspg	2024-12-02 00:35:29 +00:00
cyyever	fc74ec4989	[2/N] Avoid copy in std::get (#141826 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141826 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-12-02 00:16:48 +00:00
Jason Ansel	b2fe1b9409	[inductor] Fix 3d tiling (#141709 ) Fixes #141121 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141709 Approved by: https://github.com/eellison	2024-12-01 19:47:41 +00:00
Roy Hvaara	90f19fee8a	[MPS] Convert `channels_last_3d` to `contiguous` for input tensor in `nn.Conv3d` (#141780 ) When the input tensor to Conv3d is in the channels_last_3d memory format the Conv3d op will generate incorrect output (see example image in #141471). This PR checks if the op is 3d, and then attempts to convert the input tensor to contiguous. Added a regression test that verifies the output by running the same op on the CPU. I'm unsure if Conv3d supports the channels last memory format after #128393. If it does, we should consider updating the logic to utilize this as it would be more efficient. Perhaps @DenisVieriu97 knows or has more context? Fixes #141471 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141780 Approved by: https://github.com/malfet	2024-12-01 18:36:53 +00:00
Blaine Burton Rister	5deca07c0d	[Inductor] Represent tiling as a dict (#141751 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This makes it easier to generalize to multi-dimensional reductions. This diff refactors `self.numels` from a tuple like `(8,16)` to a dict like `{"x": 8, "r": 16}`. Note: this is based off of https://github.com/pytorch/pytorch/pull/141738, which enables `tree.is_reduction`. That PR should land first. # Test plan The existing CI provides good coverage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141751 Approved by: https://github.com/jansel	2024-12-01 09:54:34 +00:00
cyy	96be048f06	[1/N] Avoid copy in std::get (#141812 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812 Approved by: https://github.com/Skylion007	2024-12-01 03:53:35 +00:00
Blaine Burton Rister	c2fa544472	[Inductor] move block pointer analysis to a new module (#141733 ) # Summary Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. This refactors the ModularIndexing block pointer analysis into its own module. That way, we can call it from other places besides Triton codegen. In the parent PR, we will use this to find tiling splits that simplify the indexing. # Test plan Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141733 Approved by: https://github.com/jansel	2024-11-30 23:21:24 +00:00
Blaine Burton Rister	49fde426ba	[Inductor] Use a helper function to tell if a tree or prefix is a reduction (#141738 ) Preparatory refactor for https://github.com/pytorch/pytorch/pull/137243. Previously, we would typically check for reductions by `tree.prefix == "r"`. This PR moves the check into a helper function. This makes it easier to generalize the code to multi-dimensional reductions, which could have multiple prefixes like `("r0_", "r1_")`. Tested by the existing CI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141738 Approved by: https://github.com/jansel	2024-11-30 22:38:13 +00:00
Fabian Keller	394c339691	improve typings in unflatten (#141817 ) A first follow-up to https://github.com/pytorch/pytorch/pull/115074 / https://github.com/pytorch/pytorch/pull/141240 following the strategy discussed there (https://github.com/pytorch/pytorch/pull/115074#issuecomment-2480992230). This PR improves the type annotations around `unflatten.py` which had been inaccurate due to the previously suppressed type checking on `torch.nn.Module`. CC @Skylion007 @ezyang Pull Request resolved: https://github.com/pytorch/pytorch/pull/141817 Approved by: https://github.com/Skylion007	2024-11-30 22:12:15 +00:00
FFFrog	8a81f7a4b6	Refactor functions in functorch for functional (#141808 ) As the title stated Pull Request resolved: https://github.com/pytorch/pytorch/pull/141808 Approved by: https://github.com/Skylion007	2024-11-30 20:15:40 +00:00
atalman	0f3f801fc2	Add windows CUDA 12.6 nightly builds (#141805 ) Windows AMI was published to prod. This PR adds CUDA 12.6 nightly builds Pull Request resolved: https://github.com/pytorch/pytorch/pull/141805 Approved by: https://github.com/huydhn, https://github.com/Skylion007	2024-11-30 14:39:47 +00:00
eqy	9532589b53	[CUDA][64-bit indexing] Support 64-bit indexing in `distribution_elementwise_grid_stride_kernel` (#141613 ) For #141544 Overhead doesn't seem to be noticeable even on small sizes (e.g., 2**10 elements) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141613 Approved by: https://github.com/Skylion007, https://github.com/ngimel	2024-11-30 06:55:02 +00:00
Edward Z. Yang	7fafaa9c82	Introduce CompiledAOTI (#141695 ) Stacked on https://github.com/pytorch/pytorch/pull/141691 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141695 Approved by: https://github.com/aorenste ghstack dependencies: #141681, #141683, #141685, #141688, #141689, #141691	2024-11-30 00:05:41 +00:00
Bob Ren	2f72635a5c	automatic dynamic unspecialize float (#141647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647 Approved by: https://github.com/ezyang	2024-11-29 22:36:53 +00:00
cyy	e29dabbd71	Fix performance-unnecessary-copy-initialization (#141792 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141792 Approved by: https://github.com/Skylion007	2024-11-29 22:10:06 +00:00
chuanqiw	a23ac6f8bd	[CD] Enable pypi dependencies both for XPU linux and Windows whls (#141135 ) Enable xpu runtime pypi packages as dependencies of XPU CD wheels both for Linux and Windows. Fixes https://github.com/pytorch/pytorch/issues/135867 Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141135 Approved by: https://github.com/atalman	2024-11-29 21:35:07 +00:00
George Wigley	44707b0667	Pass rounding_mode for div reference inputs through kwargs (#136308 ) Previously, the reference inputs for div with rounding mode did not supply the rounding_mode keyword argument. This didn't match the sample inputs for this op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136308 Approved by: https://github.com/albanD Co-authored-by: Xia, Weiwen <weiwen.xia@intel.com> Co-authored-by: Bob Ren <bobren@meta.com> Co-authored-by: Xilun Wu <12968408+XilunWu@users.noreply.github.com> Co-authored-by: siahuat0727 <tansiahuat@gmail.com>	2024-11-29 21:28:24 +00:00
Ke Wen	ed092e2161	[2/N] Rename NCCLTraceBuffer to FlightRecorder (#141712 ) Just name change. No behavior change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141712 Approved by: https://github.com/wconstab, https://github.com/fduwjj ghstack dependencies: #141648	2024-11-29 21:15:31 +00:00
Zhengxu Chen	a8a570512b	[export] Generate compatible thrift schema out of schema.py (#141611 ) Summary: To make sure schema.py and schema.thrift are kept in sync, we use the int keys from thrift and use Python Annotated type to associate fields between thrift and schema.py. Later we will use this association to build a single source of truth between the schemas. Test Plan: CI Differential Revision: D66253157 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141611 Approved by: https://github.com/yiming0416	2024-11-29 20:09:49 +00:00
cyyever	7dd9b5fc43	Fix NOLINTNEXTLINE (#141794 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141794 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-11-29 16:23:59 +00:00
PyTorch MergeBot	9e98b3d73c	Revert "automatic dynamic unspecialize float (#141647 )" This reverts commit `1a32daeb17`. Reverted https://github.com/pytorch/pytorch/pull/141647 on behalf of https://github.com/atalman due to functorch/test_aotdispatch.py::TestAOTAutogradWithCache::test_inner_grad [GH job link](https://github.com/pytorch/pytorch/actions/runs/12080983316/job/33697901875) [HUD commit link](`1a32daeb17`) ([comment](https://github.com/pytorch/pytorch/pull/141647#issuecomment-2507980876))	2024-11-29 15:00:33 +00:00
siahuat0727	3c63e76b03	[PT2E Quantization] Fix RecursionError when prepare_pt2e graph with concat of the same node (#141651 ) Fixes #129038 Related PR #129567 Here is the new PR against main, thanks! @jerryzh168 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141651 Approved by: https://github.com/jerryzh168	2024-11-29 09:19:22 +00:00
Xilun Wu	ce572fedfc	[dtensor][random] use torch.uint64 as the seed/offset tensor dtype to avoid overflow (#141532 ) Summary DTensor RNG code raises error if the seed passed in is beyong `torch.int64` range (e.g. `torch.tensor([2**64-1])` raises error). The solution is to specify the `dtype=torch.uint64` in the `torch.tensor()` call. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141532 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220, #141223	2024-11-29 07:59:34 +00:00
Xilun Wu	93cbb287c2	[dtensor][random] allow user to manual_seed different seed on device mesh; only sync RNG state in WORLD when manual_seed has not been called (#141223 ) Summary This PR proposes 4 changes to DTensor RNG management: 1. DTensor allows users to eagerly initialize the RNG tracker by calling `torch.distributed.tensor._random.manual_seed`. 2. DTensor `manual_seed` no longer checks the integrity of the `seed` argument. Users are responsible for setting the same seed on all ranks within an SPMD group, but if there are multiple separate SPMD groups (e.g. across pipeline stages), users should set a _different_ seed for each SPMD group. For cases like Pipeline Parallel, users can set different initial seed for pipelining stages by calling ``` world_mesh = init_device_mesh( device_type="cuda", mesh_shape=(2, 2, 2), mesh_dim_names=("pp", "dp", "tp"), ) pp_mesh = world_mesh["pp"] pp_rank = pp_mesh.get_local_rank() spmd_mesh = world_mesh["dp", "tp"]._flatten("spmd") # this flattening is only needed if you need to call collective over this mesh torch.distributed.tensor._random.manual_seed(123+pp_rank, spmd_mesh) ``` In other word, if users want to call `torch.distributed.tensor._random.manual_seed`, they will be responsible for passing in the right value and DTensor won't perform any checks on it. If the current rank is not a part of the mesh, it will use the current device RNG state to initialize. 3. `OffsetBasedRNGTracker` still performs RNG state synchronization by broadcasting the RNG state on rank 0 to `WORLD`. However, calling `torch.distributed.tensor._random.manual_seed` is an exception. In this case, no broadcast will happen. 4. Enforce that the `manual_seed` call only accept "full mesh" i.e. the DTensor RNG state on every rank must be set through the call. This makes sure that no rank has its RNG state left uninitialized and the SPMD ranks have their RNG state synchronous. Motivation tl;dr 1. Lazily initializing DTensor RNG tracker causes hang in non-SPMD code such as Pipeline Parallel. 2. Users may want to set different seed on ranks in one device mesh. 3. We want to keep the old behavior if users prefer not curating the RNG state and want to have DTensor take care of it. see detail in https://github.com/pytorch/pytorch/issues/140301 Test `pytest test/distributed/_tensor/test_random_ops.py` `pytest test/distributed/tensor/parallel/test_tp_random_state.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141223 Approved by: https://github.com/wconstab ghstack dependencies: #141731, #141220	2024-11-29 07:59:34 +00:00
Xilun Wu	7f5bc9dd87	[dtensor][random][tp] remove the adhoc DTensor RNG tracker TensorParallelRNGTracker since it does not match FSDP2+TP (#141220 ) Summary The ad-hoc DTensor RNG tracker was used to mimic Megatron DDP+TP RNG behavior but it turns out not compatible with PyTorch Distributed FSDP2+TP so we decide to deprecate it and use `OffsetBasedRNGTracker` to replace, which follows the SPMD semantics (replicas get the same random sampling result, shards get different results). Motivation `TensorParallelRNGTracker` was designed for DDP+TP where the random operators produce the same result along the data parallel mesh dimension and different results along the tensor parallel dimension. However this does not apply to the new FSDP+TP composable combination where the model weights are sharded along data parallel mesh dimension as well. Therefore we decide to remove this outdated RNG tracker type for now. If users have demands for exact match between PyTorch Distributed and Megatron on Random Number generation result, feel free to file an issue. Impact `TensorParallelRNGTracker` was only used when Tensor Parallel is used (i.e. calling `parallelize_module`). For non-FSDP users, the "replicas get the same random numbers and shards get different ones" remains unchanged. Unlike `TensorParallelRNGTracker` which sets different seeds (`base_seed + 2718 + TP_rank`) within the TP group, DTensor now sets the same seed (default value is 1234 but users can call `torch.distributed.tensor._random.manual_seed` to modify) on all ranks but choose the right RNG offset based on DTensor placements to enforce the "replicas get the same random numbers and shards get different ones" invariant. For FSDP2 users, improvement should be observed in a way that DTensor sharded within DP group now gets different random number sampling which `TensorParallelRNGTracker` failed to do, though we're not sure how much this change will improve the eventual training loss convergence. Test 1-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_tp_model_meta_init` 2-d model weight meta init: `pytest test/distributed/_tensor/test_random_ops.py -s -k test_fsdp_tp_model_meta_init` TP model weight init test: `pytest test/distributed/tensor/parallel/test_tp_random_state.py` FSDP+TP model weight init test: `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141220 Approved by: https://github.com/wconstab ghstack dependencies: #141731	2024-11-29 07:59:26 +00:00
Xilun Wu	c55191f3a2	[dtensor][random] add 1d and 2d model meta init tests (#141731 ) Summary Added tests for model meta init on 1-d mesh (TP) and 2-d mesh (FSDP+TP). This exploits the issue where DTensor RNG failed to initialize weights differently across FSDP ranks. Test `pytest test/distributed/_tensor/test_random_ops.py -s -k meta_init` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141731 Approved by: https://github.com/wconstab	2024-11-29 07:59:20 +00:00
Bob Ren	1a32daeb17	automatic dynamic unspecialize float (#141647 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141647 Approved by: https://github.com/ezyang	2024-11-29 07:53:53 +00:00
Xia, Weiwen	9827d677b4	[Quant][PT2E][X86] annotate and convert for linear_dynamic_fp16 (#141480 ) Annotate linear node for `linear_dynamic_fp16` with `X86InductorQuantizer` After `convert_pt2e`, the pattern will be ``` x \| linear <- to_fp32 <- to_fp16 <- w ``` Test plan ``` pytest test/quantization/pt2e/test_x86inductor_quantizer.py -k test_linear_dynamic_fp16 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141480 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2024-11-29 07:48:39 +00:00
Yang Wang	b7a45dbae3	Add monitor script (#141438 ) # Overview Add monitor script to collect system-level utilization data during CI tests. Currently all monitoring scripts are disabled. # Details - Add flag to customize the time intervals for logging - Enable multiple GPU utilization logging # Next step enable monitor scritpt in non-perf-test workflows Pull Request resolved: https://github.com/pytorch/pytorch/pull/141438 Approved by: https://github.com/huydhn	2024-11-29 04:14:31 +00:00
Roy Hvaara	4d5c096a55	[MPS] Add autocast rule for SDPA (#141776 ) Fixes #141774 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141776 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-11-29 03:34:03 +00:00
Edward Z. Yang	b97a786125	Inline compile_to_fn at its only call site (#141691 ) Stacked on https://github.com/pytorch/pytorch/pull/141689 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141691 Approved by: https://github.com/jansel ghstack dependencies: #141681, #141683, #141685, #141688, #141689	2024-11-29 01:15:38 +00:00
Edward Z. Yang	9e4723cc6e	Unify post_compile1 and CompiledFxGraph constructor (#141689 ) Stacked on https://github.com/pytorch/pytorch/pull/141688 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141689 Approved by: https://github.com/jansel ghstack dependencies: #141681, #141683, #141685, #141688	2024-11-29 01:15:38 +00:00
Edward Z. Yang	29326b9d29	Hoist post_compile1 into fx_codegen_and_compile (#141688 ) Stacked on top of https://github.com/pytorch/pytorch/pull/141685 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141688 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #141681, #141683, #141685	2024-11-29 01:15:31 +00:00
Edward Z. Yang	cf3daf723f	Unify cache disable and cache bypass paths (#141685 ) I was constantly annoyed at the fact that we had a separate else branch for when cache was disabled which was distinct from when cache was bypassed. This diff gets rid of the disabled cache branch, so we use the same logic for bypass/disable. I actually think this change probably didn't actually matter much for the POC but I think it's cleaner. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141685 Approved by: https://github.com/aorenste ghstack dependencies: #141681, #141683	2024-11-29 01:15:24 +00:00
Aaron Gokaslan	7224cd4471	[BE]: Update 12.6 builds to CUDA 12.6.3 (#141433 ) Update CUDA 12.6 to Update 3 and make cusparse-lt 0.6.3? #141365 Was going to leave some comments on #141365, but though it was just faster to open a PR here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141433 Approved by: https://github.com/atalman	2024-11-28 22:01:47 +00:00
Richard Barnes	ae6519cb74	[codemod] c10::string_view -> std::string_view in fields (#141736 ) Summary: `c10::string_view` is being removed, so we need to migrate. Test Plan: Sandcastle Reviewed By: palmje Differential Revision: D65830276 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141736 Approved by: https://github.com/Skylion007	2024-11-28 21:35:53 +00:00
Ivan Zaitsev	09a3eddc07	Revert #141066 and #141494 (#141721 ) manual revert due to merge conflicts note: #141494 was reverted out of order blocking automatic revert of #141066 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141721 Approved by: https://github.com/avikchaudhuri	2024-11-28 20:18:19 +00:00
PyTorch MergeBot	d08bd6d627	Revert "Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587 )" This reverts commit `8a3317cd41`. Reverted https://github.com/pytorch/pytorch/pull/141587 on behalf of https://github.com/atalman due to inductor/test_torchinductor_strided_blocks.py::TritonBlockPointerTestGPU::test_expand_broadcast_x_size0_y_size0_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/12072823884/job/33669367764) [HUD commit link](`8a3317cd41`) ([comment](https://github.com/pytorch/pytorch/pull/141587#issuecomment-2506690095))	2024-11-28 19:41:03 +00:00
Pruthvi Madugundu	907c31f529	[ROCm] devtoolset / GCC11 upgrade on manylinux images - 1b of 2 (docker images) (#141609 ) Upgrade gcc version from 9 to 11 on ROCm manylinux images. Needed for #141423 since almalinux8-based manylinux2_28 images for ROCm (#140681) installs gcc-toolset-9, which installs [gcc 9.2.1](https://pkgs.org/download/gcc-toolset-9-gcc-c++). However, PyTorch CMakeLists.txt enforces a [minimum gcc version of 9.3](`5318bf8baf/CMakeLists.txt (L61)`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/141609 Approved by: https://github.com/jeffdaily Co-authored-by: Jithun Nair <jithun.nair@amd.com>	2024-11-28 19:18:09 +00:00
Bludator	f4187050fe	[ONNX] Remove special handling of torchvision.ops imports in onnx export (#141569 ) Fixes #141568 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141569 Approved by: https://github.com/titaiwangms Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com> Co-authored-by: Ti-Tai Wang <titaiwang@microsoft.com>	2024-11-28 18:05:40 +00:00
Edward Z. Yang	6d204cb5ed	Hoist set_feature_use out of conditional, rename some variables (#141683 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141683 Approved by: https://github.com/jamesjwu, https://github.com/jansel ghstack dependencies: #141681	2024-11-28 17:43:11 +00:00
Edward Z. Yang	229daf7470	Inline FxGraphCache.load into its sole call site (#141681 ) I need to restructure the body of FxGraphCache.load with the outer if-else in its call site, so inline it goes! Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/141681 Approved by: https://github.com/jamesjwu, https://github.com/jansel	2024-11-28 17:43:11 +00:00
chuanqiw	b9a8df4bdd	[CD] Add triton xpu build back (#141775 ) Triton xpu build was stopped by https://github.com/pytorch/pytorch/pull/139206 temporally to wait triton xpu upgrade PR https://github.com/pytorch/pytorch/pull/137886 landed. Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141775 Approved by: https://github.com/atalman	2024-11-28 17:37:42 +00:00
cyy	6b430c26bd	Fix bugprone-argument-comment (#141777 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141777 Approved by: https://github.com/Skylion007	2024-11-28 16:56:50 +00:00
Mwiza Kunda	8a3317cd41	Refactor test_torchinductor_strided_blocks to also support triton CPU (#141587 ) This increases test coverage for triton CPU from just test_torchinductor.py to also testing block pointer lowering. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/141587 Approved by: https://github.com/jansel	2024-11-28 16:45:25 +00:00
Nan Zhang	5aacfa037b	[Inductor] fix broadcast logic for Triton (#141027 ) (#141693 ) Summary: Fix logic for inserting broadcast on kernel with load going directly to store. In the case where load is going directly to store, we insert a tl.broadcast on the store, regardless of the block size on the load. In the case where a broadcast is not required, the downstream Triton compiler is expected to remove this no-op broadcast instruction. Test Plan: Added tests under test_torchinductor_strided_blocks.py:test_expand_broadcast in OSS and internal test cases. Reviewed By: blaine-rister Differential Revision: D65518033 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141693 Approved by: https://github.com/blaine-rister	2024-11-28 16:38:25 +00:00
Laith Sakka	f684dbd002	Try to simplify FloorDiv axioms implications when needed during evaluations. (#141267 ) Summary: This very much the same solution proposed by bobrenjc93 except that it restrict it to expressions and axioms that have FloorDiv, since those are the only ones that could have became CleanDiv. and the only one that can changes as shape env changes. This also does not break torchrec benchmarks, it might be worth it to know why the generalization of this does break the torchrec benchmarks, but we could just be hitting another bug or NYI situation. ovearhead? None on ``` buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:pt2_compile_benchmark -- --num-features=1000 ``` Differential Revision: D66307433 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141267 Approved by: https://github.com/ezyang	2024-11-28 15:35:35 +00:00
chuanqiw	d49f0bf466	[CI] Fix xpu linux ci build environment duplicated issue (#141546 ) We found that there are duplicated build environments in XPU linux ci test, it led to test jobs may download wrong pytorch build artifact file. Refer https://github.com/pytorch/pytorch/actions/runs/12023238798/job/33518351906#step:14:633 Works for https://github.com/pytorch/pytorch/issues/139722 and https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/141546 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-11-28 14:21:21 +00:00
atalman	0f261e8f77	Add Manylinux2014 and Manylinux 2.28 config to triton builds. Run auditwheel on triton binaries (#141704 ) This PR combines Manylinux 2_28 and Manylinux 2014 builds of triton under one workflow. This is required in order to support torch cpu, cuda 118, cuda 12.4 wheels built with Manylinux 2014 and torch cuda 12.6 wheels built with Manylinux 2_28. Manylinux 2014 wheels: ``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl`` Manylinux 2_28 wheels: ``pytorch_triton-3.2.0+git35c6c7c6-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/141704 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/huydhn	2024-11-28 13:40:39 +00:00
eellison	f83361b274	inductor dtype propagation fixes (#141495 ) - Add in upcast_compute_type on creation of new tensors (loads, constants) - Fixes index_expr - right now we are sort of inconsistent in dtype and dont always respect the dtype specified. would be nice to fix but not doing in this pr. - bug fix in view dtype where we were always upcasting back to fp32 when input was in bf16/fp16. we should only be doing that if the output is also in bf16/fp16. - for masked, avoid calling dtype propagation and just use output dtype. Turns on the runtime dtype verification for opinfo tests. The separate test file is still useful because we can use it for testing turning off codegen_upcast_to_fp32. Follow ups: - We could consider requiring less explicit upcast_compute_types calls and do it automatically. That would potentially make things easier but be less flexible in the future. Maybe I should have done it this pr. - Be more consistent on our index expr dtype printing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141495 Approved by: https://github.com/blaine-rister, https://github.com/arui-meta, https://github.com/ezyang ghstack dependencies: #139945, #140057	2024-11-28 11:39:38 +00:00

1 2 3 4 5 ...

81589 commits