pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Xuehai Pan	8cd4b01f85	Update [ghstack-poisoned]	2025-02-10 22:00:05 +08:00
Xuehai Pan	20f61ffa29	Update (base update) [ghstack-poisoned]	2025-02-10 22:00:05 +08:00
Xilun Wu	c4d835fbab	[DTensor][conv] add DTensor convolution_backward op support for case where the input Tensor has requires_grad=False (#142278 ) Fixes #142058 ## Summary DTensor `convolution_backward` op throws exception when the input Tensor has `requires_grad=False` which happens if the conv layer is the first layer in the model. ATEN convolution_backward op Usually returns 3 Tensors (grad_input, grad_weight, grad_bias) and the `grad_input` is actually an Optional[Tensor] which can be `None` in the case mentioned above. However, the DTensor sharding propagation rule and corresponding TP conv backward implementation both assume that the `grad_input` would be existent. ## Fix allow the `grad_input` to be `None` for `convolution_backward` op. ## Test `pytest test/distributed/tensor/test_convolution_ops.py` ## Follow-up The current implementation of DTensor conv op also ignores `output_mask` and this may need further care. Pull Request resolved: https://github.com/pytorch/pytorch/pull/142278 Approved by: https://github.com/bdhirsh	2025-02-10 07:06:40 +00:00
Ke Wen	effc545274	[DDP] Use NCCL allocated memory for gradient bucket (#146589 ) So that NVLink SHARP comes with zero-copy on H100+ platforms, for DDP applications. Less SM usage, less memory contention between NCCL kernel and compute kernels. Added env `DDP_DISABLE_COMM_MEM` as a back-out option: ``` An environment variable to disable comm-optimized memory pool. Default is 0, which means comm-optimized memory pool is enabled. Users can set it to 1 in case of seeing regression or OOM (because this comm MemPool may not share space with regular compute MemPool). ``` Differential Revision: [D69297766](https://our.internmc.facebook.com/intern/diff/D69297766) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146589 Approved by: https://github.com/syed-ahmed, https://github.com/c-p-i-o, https://github.com/fduwjj	2025-02-10 05:23:11 +00:00
Simon Fan	387c993c3b	[ca] remove private API: _compiled_autograd_should_lift (#146720 ) Since the functional autograd + compiled autograd migration, we don't trace into nodes anymore, and everything is lifted. We can't support this flag which tries to inline make_fx style in CA initial pass. There's no more usage internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146720 Approved by: https://github.com/zou3519	2025-02-10 04:29:57 +00:00
zeshengzong	e8304f08fe	Fix torch.take_along_dim param type and default description (#146474 ) ## Changes - Change type description to `LongTensor`, consistent with [`torch.take`](https://pytorch.org/docs/stable/generated/torch.take.html) - Add `dim` param default value description ## Test Result Before ![image](https://github.com/user-attachments/assets/720ce158-2bc1-48b5-a188-56fcc7188d96) After ![image](https://github.com/user-attachments/assets/05fe20bd-9476-4b97-ac2b-9b161d6532a1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146474 Approved by: https://github.com/mikaylagawarecki	2025-02-10 01:19:30 +00:00
Simon Fan	298226f358	[dynamo] check for incompatible configs (#146513 ) internal: https://fb.workplace.com/groups/1075192433118967/permalink/1599802033991335/ Assuming flags don't change during compilation, we shouldn't allow incompatible configs to be set at torch.compile wrap time. Not in this PR: For flags that need to change during compilation, we'd have to be strict about where they can be used in the compile lifecycle Pull Request resolved: https://github.com/pytorch/pytorch/pull/146513 Approved by: https://github.com/williamwen42 Co-authored-by: Gabriel Ferns <gabeferns@meta.com>	2025-02-10 00:44:23 +00:00
Davide Italiano	2a55311773	[cuda] Simplify the sinc function a bit. (#146774 ) `else` after `return` can be removed & the indentation can be reduced, for readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146774 Approved by: https://github.com/malfet	2025-02-09 20:09:34 +00:00
drisspg	b133907d0a	Update strided test to float32 (#146748 ) Fixes #146377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146748 Approved by: https://github.com/BoyuanFeng, https://github.com/leijurv	2025-02-09 17:41:35 +00:00
Davide Italiano	91c4bf39d3	[mps] Add a shader for spherical_bessel_j0. (#146771 ) In preparation for adding the operation to inductor/eager. Adapted from the CUDA version of the shader. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146771 Approved by: https://github.com/malfet	2025-02-09 05:11:17 +00:00
Nikita Shulga	0e83e7d56e	[EZ] Add logic to build Metal shader with debug info (#146768 ) By appending `-frecord-sources -gline-tables-only` to the compilation command Helpful when debugging shaders compiled into libtorch Test plan: Run `python ../tools/build_with_debinfo.py ../aten/src/ATen/native/mps/kernels/UpSample.metal ../aten/src/ATen/native/mps/operations/UpSample.mm` And then run following to capture shader and check that it contains debug info ```python import torch import os os.environ["MTL_CAPTURE_ENABLED"]="1" inp = torch.rand(size=(6, 3, 10, 20), device="mps", dtype=torch.float32) with torch.mps.profiler.metal_capture("bilinear2d"): out = torch.nn.functional.interpolate(x, scale_factor=(1.7,0.9), mode="bilinear") ``` <img width="769" alt="image" src="https://github.com/user-attachments/assets/e0316c1c-07a4-4da5-97b9-886c56857c1d" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146768 Approved by: https://github.com/dcci	2025-02-08 23:40:23 +00:00
Guilherme Leobas	6a9a02acbe	Set `enable_faithful_generator_behavior` flag to True (#142513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/142513 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420, #145223	2025-02-08 22:42:12 +00:00
Guilherme Leobas	580a305681	Raise MutationError if there are side effects when returning generator (#145223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145223 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424, #144420	2025-02-08 22:42:12 +00:00
Guilherme Leobas	68cfd36c11	Add `CLEANUP_THROW` bytecode (#144420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144420 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423, #144424	2025-02-08 22:42:12 +00:00
Guilherme Leobas	53ab82d8f5	Implement `generator.throw(exception)` (#144424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144424 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422, #144423	2025-02-08 22:42:12 +00:00
Guilherme Leobas	8ee095f7c1	Implement `generator.close()` (#144423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144423 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421, #144422	2025-02-08 22:42:12 +00:00
Guilherme Leobas	ca9b16e070	Implement `generator.send(..)` (#144422 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144422 Approved by: https://github.com/zou3519 ghstack dependencies: #141055, #144421	2025-02-08 22:42:12 +00:00
Guilherme Leobas	d798831167	Implement `generator.__iter__()` (#144421 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144421 Approved by: https://github.com/zou3519 ghstack dependencies: #141055	2025-02-08 22:42:12 +00:00
Guilherme Leobas	8603a1c870	Suport generators (#141055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/141055 Approved by: https://github.com/zou3519	2025-02-08 22:42:12 +00:00
Scott Wolchok	ade8fee512	Use c10 version of half/bfloat16 in executorch (#144111 ) Summary: X-link: https://github.com/pytorch/executorch/pull/7040 Accomplished by importing relevant files from c10 into executorch/runtime/core/portable_type/c10, and then using `using` in the top-level ExecuTorch headers. This approach should keep the ExecuTorch build hermetic for embedded use cases. In the future, we should add a CI job to ensure the c10 files stay identical to the PyTorch ones. ghstack-source-id: 260047850 exported-using-ghexport Test Plan: builds Differential Revision: D66106969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111 Approved by: https://github.com/malfet	2025-02-08 22:40:14 +00:00
eellison	92b7e610ab	[Inductor changes] Invoke Quant (#139102 ) Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0). The primary motivations are - Unifying scattered reasoning for quant operators throughout the code base - Easy of pattern matching - see this very large pattern match expression [here](`949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426)`. Compared to the pattern I have in the tests: ``` @register_graph_pattern( CallFunction( torch.ops.aten.mm, CallFunction( torch.ops.higher_order.invoke_quant, Ignored(), Ignored(), Ignored(), scheme="nf4", ), Arg(), ), pass_dict=test_pass, ) ``` - Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul. Example graph: ``` Python ===== AFTER POST GRAD ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, kwargs, quant_options=self) # type: ignore[call-arg] repeated_subgraph0 = self.repeated_subgraph0 invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4'); repeated_subgraph0 = arg0_1 = arg1_1 = None return (invoke_quant,) class repeated_subgraph0(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, *kwargs, quant_options=self) # type: ignore[call-arg] mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = None add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1); mul = arg1_1 = None return add ``` The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, args, scheme=None)` where the scheme will not always be present. I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing. ``` invoke_quant = InvokeQuant(codegen_low_precision=True) invoke_quant(gn, (x, y), scheme="nf4") ``` Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162. Feedback welcome. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102 Approved by: https://github.com/Chillee	2025-02-08 19:30:19 +00:00
Blaine Burton Rister	a1bfb39a31	[Inductor] Expand Identity ops prior to block pattern matching (#146000 ) # Feature Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s. This PR adds a few features to achieve this invariance. - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression. - Preprocess the expression with this expansion prior to pattern matching. - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`. # Test plan This PR adds a few new unit tests: - Added a unit test specifically for `expr.expand(identity=True)`. - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops. I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000 Approved by: https://github.com/eellison, https://github.com/jansel	2025-02-08 18:11:53 +00:00
Jason Ansel	eee5622b98	[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297	2025-02-08 18:00:49 +00:00
Jason Ansel	c098385cb3	[inductor] Refactor CaptureIndexing into global scope (#146297 ) And inline SimplifyIndexing into it CaptureIndexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282	2025-02-08 18:00:49 +00:00
Jason Ansel	d35f6b2339	[inductor] Minor compile time optimizations in DefaultHandler (#146282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257	2025-02-08 18:00:40 +00:00
Jason Ansel	06604c4ec1	[inductor] Refactor op handlers part 5 (#146257 ) This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255	2025-02-08 18:00:30 +00:00
Jason Ansel	403db2faee	[inductor] Refactor op handlers part 4 (#146255 ) This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2. Some compile time wins from this as well: ``` 2025-02-02T19:46:32.2033010Z 2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2037575Z 2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones 2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50% 2025-02-02T19:46:32.2040131Z 2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2042188Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254	2025-02-08 18:00:17 +00:00
Jason Ansel	0e31e5932b	[inductor] Refactor op handlers part 3 (#146254 ) Fixes type errors that arise from typing `V.ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254 Approved by: https://github.com/shunting314 ghstack dependencies: #146252	2025-02-08 18:00:08 +00:00
Jason Ansel	71498aeae3	[inductor] Refactor op handlers part 2 (#146252 ) This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252 Approved by: https://github.com/yanboliang	2025-02-08 18:00:00 +00:00
Xuehai Pan	c9783c9846	Update [ghstack-poisoned]	2025-02-08 19:31:10 +08:00
Xuehai Pan	ea906a47c0	Update (base update) [ghstack-poisoned]	2025-02-08 19:31:10 +08:00
cyyever	46e83bb637	Fix linter F821 error (#146665 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-08 07:19:37 +00:00
Natalia Gimelshein	a3ca5c7f4e	remove incorrect warnings from min/max documentation (#146725 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-02-08 05:10:08 +00:00
Justin Chu	63c2909ae3	[ONNX] Adjust and add deprecation messages (#146639 ) Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete. Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639 Approved by: https://github.com/titaiwangms	2025-02-08 05:09:16 +00:00
Nikita Shulga	2328dcccb9	[MPSInductor] Implement Welford reduction (#146703 ) Still work in progress, though fallback works as expected, but custom shader is not Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703 Approved by: https://github.com/jansel, https://github.com/dcci	2025-02-08 05:00:00 +00:00
drisspg	69feef5a94	Fix broken meta function for flex-attention backwards (#146563 ) # Summary Fixes https://github.com/pytorch/pytorch/issues/146377 So what was the original problem: we were codegening a really weird epilogue: ```Python # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 64off_hkvks2 + 128off_zqks2 tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64index_n + off_hkvks1, dk.shape)), dk, mask) x5 = (xindex % ks3) tmp2 = tl.load(out_ptr0 + (x5 + ks1off_hkv), mask, eviction_policy='evict_last') tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask) ``` This epilogue was writing and then reading from overlapping regions of memory causing a race condition. ### Why were we generating this epilgoue During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why. This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563 Approved by: https://github.com/Chillee	2025-02-08 04:13:52 +00:00
David Peixotto	9c78fb920d	Fix assertion failure in gemm template lowering (#146353 ) Summary: This commit fixes a crash in the gemm template lowering caused by hitting an [assert](`fd515e4f59/torch/_inductor/codegen/common.py (L1181)`) that a buffer was previously removed. The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set. The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings. Differential Revision: D68814625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-02-08 01:52:20 +00:00
cyy	6cb2f737ee	Enable Windows tests (#146666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666 Approved by: https://github.com/albanD	2025-02-08 00:55:20 +00:00
Isalia20	0ab67299c3	[MPS] lu unpack (#146681 ) Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-08 00:16:17 +00:00
Gregory Comer	803661526e	Update ET pin to 41e7ffa (#145831 ) ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework. This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831 Approved by: https://github.com/metascroy	2025-02-07 23:52:20 +00:00
Hyunho Yeo	dcac3c3e06	[MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659 ) Summary: Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated". Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/12103424065182810 Reviewed By: yuhc Differential Revision: D68988435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659 Approved by: https://github.com/nautsimon	2025-02-07 23:06:35 +00:00
Dingming Wu	fa34128435	revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453 ) Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass. Test Plan: With the change, build of APS model using rcclexp can now pass: `sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0` Reviewed By: c-p-i-o Differential Revision: D69149588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453 Approved by: https://github.com/c-p-i-o	2025-02-07 22:43:52 +00:00
Avik Chaudhuri	103c8b44bc	move and fix logic to update unbacked bindings (#146115 ) Summary: Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here). This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde. Test Plan: added test D68880766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115 Approved by: https://github.com/pianpwk	2025-02-07 22:41:19 +00:00
Lu Fang	45d35f5f5a	Clean up op BC check list (#146577 ) Summary: Remove the expired ones Test Plan: ci Differential Revision: D69226556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577 Approved by: https://github.com/hl475	2025-02-07 22:40:49 +00:00
Henry Hu	908133f682	[TreeSpec] Add custom comparision function (#146442 ) Summary: https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call. However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner' Type comparison will yield False when local scopes are different due to lru_cache. Since this comparison is only used for testing purpose, we will only test if str(type) are equal. Test Plan: ``` PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py ``` Differential Revision: D69137706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442 Approved by: https://github.com/angelayi	2025-02-07 22:39:21 +00:00
drisspg	91dfa82981	[FlexAttention] Fix dynamic shapes in max-autotune (#146657 ) # Fixes https://github.com/pytorch/pytorch/issues/146624 ### Updated From offline discussion going w/ sizehint However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input. I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe). For instance, in the repro, we quickly hit the recompile limit. ```Shell torch._dynamo hit config.recompile_limit (8) function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161) last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546 To log all recompilation reasons, use TORCH_LOGS="recompiles". To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657 Approved by: https://github.com/Chillee, https://github.com/yanboliang	2025-02-07 22:34:28 +00:00
Jason Ansel	579b9f2ed9	[inductor] Better exception error messages for cache_on_self (#146652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652 Approved by: https://github.com/yanboliang	2025-02-07 21:22:21 +00:00
Jason Ansel	04ce02182b	[inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-02-07 21:21:21 +00:00
PyTorch MergeBot	80a1696679	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit `5f0901e573`. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))	2025-02-07 21:04:23 +00:00
Henry Tsang	206ad9f4ad	[cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554 ) This PR does a few things: * set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten * Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed. * remove/disable the tests on A100. Let me elaborate a bit more. There are two types of A100 tests. * normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template. * tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic Differential Revision: D69209929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554 Approved by: https://github.com/chenyang78	2025-02-07 19:59:28 +00:00

1 2 3 4 5 ...

84246 commits