pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Scott Wolchok	ade8fee512	Use c10 version of half/bfloat16 in executorch (#144111 ) Summary: X-link: https://github.com/pytorch/executorch/pull/7040 Accomplished by importing relevant files from c10 into executorch/runtime/core/portable_type/c10, and then using `using` in the top-level ExecuTorch headers. This approach should keep the ExecuTorch build hermetic for embedded use cases. In the future, we should add a CI job to ensure the c10 files stay identical to the PyTorch ones. ghstack-source-id: 260047850 exported-using-ghexport Test Plan: builds Differential Revision: D66106969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144111 Approved by: https://github.com/malfet	2025-02-08 22:40:14 +00:00
eellison	92b7e610ab	[Inductor changes] Invoke Quant (#139102 ) Adds a `invoke_quant` higher order operator as proposed [here](https://docs.google.com/document/d/1s2PfJlq6Q1F8l11CkTIC69BW1rEnGEgs6YmBC7hu8rA/edit?tab=t.0). The primary motivations are - Unifying scattered reasoning for quant operators throughout the code base - Easy of pattern matching - see this very large pattern match expression [here](`949fdd2997/torch/_inductor/fx_passes/post_grad.py (L390-L426)`. Compared to the pattern I have in the tests: ``` @register_graph_pattern( CallFunction( torch.ops.aten.mm, CallFunction( torch.ops.higher_order.invoke_quant, Ignored(), Ignored(), Ignored(), scheme="nf4", ), Arg(), ), pass_dict=test_pass, ) ``` - Ability to specify inductor specific logic, like codegen'ing the operators in lower precision, or forcing fusion to a matmul. Example graph: ``` Python ===== AFTER POST GRAD ===== /data/users/eellison/pytorch/torch/fx/_lazy_graph_module.py class <lambda>(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, kwargs, quant_options=self) # type: ignore[call-arg] repeated_subgraph0 = self.repeated_subgraph0 invoke_quant: "f32[8][1]cpu" = torch.ops.higher_order.invoke_quant(repeated_subgraph0, arg0_1, arg1_1, scheme = 'nf4'); repeated_subgraph0 = arg0_1 = arg1_1 = None return (invoke_quant,) class repeated_subgraph0(torch.nn.Module): def forward(self, arg0_1: "f32[8][1]cpu", arg1_1: "f32[8][1]cpu"): # File: /data/users/eellison/pytorch/torch/_higher_order_ops/invoke_quant.py:87 in __call__, code: return invoke_quant_tracer(args, *kwargs, quant_options=self) # type: ignore[call-arg] mul: "f32[8][1]cpu" = torch.ops.aten.mul.Tensor(arg0_1, arg1_1); arg0_1 = None add: "f32[8][1]cpu" = torch.ops.aten.add.Tensor(mul, arg1_1); mul = arg1_1 = None return add ``` The schema for `invoke_quant` is `torch.ops.higher_order.invoke_quant(subgraph, args, scheme=None)` where the scheme will not always be present. I wasn't sure exactly how the inductor specific configurations like `codgen_in_low_precision` should be passed through. I didnt want to stuff them all in as kwargs, and I didn't want to have them affect pattern matching. So they will be stored as meta of the node itself. And, following that, I wanted the invocation of the hop to match how it will show up in the graph. So I decided to have it be an object that is then invoked for the tracing. ``` invoke_quant = InvokeQuant(codegen_low_precision=True) invoke_quant(gn, (x, y), scheme="nf4") ``` Todo - not require the packing of args in a tuple, will do following https://github.com/pytorch/pytorch/pull/139162. Feedback welcome. Pull Request resolved: https://github.com/pytorch/pytorch/pull/139102 Approved by: https://github.com/Chillee	2025-02-08 19:30:19 +00:00
Blaine Burton Rister	a1bfb39a31	[Inductor] Expand Identity ops prior to block pattern matching (#146000 ) # Feature Inductor sometimes uses `Identity` functions to group various terms of an expression. While this is convenient in some scenarios, it can frustrate pattern matching. For example, when we're matching an indexing expression to tell if it can be represented as a block pointer, that analysis should be invariant to `Identity`'s. This PR adds a few features to achieve this invariance. - Create a new expansion mode `expr.expand(identity=True)`, which removes all `Identity` functions from the expression. - Preprocess the expression with this expansion prior to pattern matching. - Bonus: create a new test utility function called `dummy_graph()`, which creates a simple `GraphLowering`. This is useful for testing the pattern matcher, as we need to initialize `V.graph` before we can access `V.graph.sizevars`. # Test plan This PR adds a few new unit tests: - Added a unit test specifically for `expr.expand(identity=True)`. - Added a new unit test module for the block pattern matcher. Tested that we can correctly match some example patterns containing Identity ops. I originally intended to add an end to end test compiling pointwise cat, and mapping the corresponding memory accesses to block pointers. However, it looks like that will take more work, since the [relevant code path](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codegen/triton.py#L1306) disables block pointer analysis. It might be better to defer that to a future PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146000 Approved by: https://github.com/eellison, https://github.com/jansel	2025-02-08 18:11:53 +00:00
Jason Ansel	eee5622b98	[inductor] Pre-populate cache for simplify_with_ranges return value (#146373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146373 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282, #146297	2025-02-08 18:00:49 +00:00
Jason Ansel	c098385cb3	[inductor] Refactor CaptureIndexing into global scope (#146297 ) And inline SimplifyIndexing into it CaptureIndexing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146297 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257, #146282	2025-02-08 18:00:49 +00:00
Jason Ansel	d35f6b2339	[inductor] Minor compile time optimizations in DefaultHandler (#146282 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146282 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255, #146257	2025-02-08 18:00:40 +00:00
Jason Ansel	06604c4ec1	[inductor] Refactor op handlers part 5 (#146257 ) This makes OpHandler just a normal class using inheritance, and removes typing workarounds needed because it wasn't Pull Request resolved: https://github.com/pytorch/pytorch/pull/146257 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254, #146255	2025-02-08 18:00:30 +00:00
Jason Ansel	403db2faee	[inductor] Refactor op handlers part 4 (#146255 ) This replaces the `__getattr__()` pattern used in remaining OpHandlers with a `DefaultHandler` class defined in part 2. Some compile time wins from this as well: ``` 2025-02-02T19:46:32.2033010Z 2025-02-02T19:46:32.2036607Z WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 29633182927 is -1.71% lower than expected 30150000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2037575Z 2025-02-02T19:46:32.2037907Z please update all results that changed significantly, and not only the failed ones 2025-02-02T19:46:32.2039291Z PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 43986879172 -1.02% is within expected 44440000000 ±2.50% 2025-02-02T19:46:32.2040131Z 2025-02-02T19:46:32.2041180Z WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26246225695 is -1.85% lower than expected 26740000000 ±1.50% please update the expected results. 2025-02-02T19:46:32.2042188Z ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146255 Approved by: https://github.com/shunting314 ghstack dependencies: #146252, #146254	2025-02-08 18:00:17 +00:00
Jason Ansel	0e31e5932b	[inductor] Refactor op handlers part 3 (#146254 ) Fixes type errors that arise from typing `V.ops`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146254 Approved by: https://github.com/shunting314 ghstack dependencies: #146252	2025-02-08 18:00:08 +00:00
Jason Ansel	71498aeae3	[inductor] Refactor op handlers part 2 (#146252 ) This replaces the `__getattr__()` pattern used in (some) OpHandlers with a `DefaultHandler` class that has an implementation of every op that calls `self._default()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146252 Approved by: https://github.com/yanboliang	2025-02-08 18:00:00 +00:00
cyyever	46e83bb637	Fix linter F821 error (#146665 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146665 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-02-08 07:19:37 +00:00
Natalia Gimelshein	a3ca5c7f4e	remove incorrect warnings from min/max documentation (#146725 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146725 Approved by: https://github.com/wdvr, https://github.com/malfet	2025-02-08 05:10:08 +00:00
Justin Chu	63c2909ae3	[ONNX] Adjust and add deprecation messages (#146639 ) Adjust and add deprecation messages to torch.onnx utilities and verification methods because they are only related to torch script and are obsolete. Removed unused `_exporter_states.py` and removed the internal deprecation module in favor of the typing_extensions deprecated decorator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146639 Approved by: https://github.com/titaiwangms	2025-02-08 05:09:16 +00:00
Nikita Shulga	2328dcccb9	[MPSInductor] Implement Welford reduction (#146703 ) Still work in progress, though fallback works as expected, but custom shader is not Pull Request resolved: https://github.com/pytorch/pytorch/pull/146703 Approved by: https://github.com/jansel, https://github.com/dcci	2025-02-08 05:00:00 +00:00
drisspg	69feef5a94	Fix broken meta function for flex-attention backwards (#146563 ) # Summary Fixes https://github.com/pytorch/pytorch/issues/146377 So what was the original problem: we were codegening a really weird epilogue: ```Python # first compute broadcasted dk of shape [Bq, Hkv, KV_LEN, V_HEAD_DIM] # then reduce to dk of shape [Bkv, Hkv, KV_LEN, V_HEAD_DIM] xindex = index_k + 64index_n + 64off_hkvks2 + 128off_zqks2 tl.store(out_ptr0 + (tl.broadcast_to(index_k + 64index_n + off_hkvks1, dk.shape)), dk, mask) x5 = (xindex % ks3) tmp2 = tl.load(out_ptr0 + (x5 + ks1off_hkv), mask, eviction_policy='evict_last') tl.store(out_ptr1 + (tl.broadcast_to(xindex, dk.shape)), tmp2, mask) ``` This epilogue was writing and then reading from overlapping regions of memory causing a race condition. ### Why were we generating this epilgoue During the lowering we created a buffer w/ a different size/stride from the expected return strides. I :think this added an implicit node (for doing the permutation of this wrongly strided output to the the expected one from the meta func. The scheduler for some reason thought it was okay to fuse this into the epilogue, tbh I dont know why. This fixes the broken meta func and the original repro. I will add a test but it is hard to pop, better than nothing Pull Request resolved: https://github.com/pytorch/pytorch/pull/146563 Approved by: https://github.com/Chillee	2025-02-08 04:13:52 +00:00
David Peixotto	9c78fb920d	Fix assertion failure in gemm template lowering (#146353 ) Summary: This commit fixes a crash in the gemm template lowering caused by hitting an [assert](`fd515e4f59/torch/_inductor/codegen/common.py (L1181)`) that a buffer was previously removed. The assert triggers because in the first gemm lowering we use a local accumulation buffer, which causes the original buffer name to be added to the `removed_buffers` set. Then in the next gemm lowering we use the global buffer for accumulation, but that buffer name is already in the `removed_buffers` set. The fix is to add a unique suffix to the buffer name to avoid triggering the assert from different gemm lowerings. Differential Revision: D68814625 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146353 Approved by: https://github.com/leslie-fang-intel, https://github.com/frost-intel, https://github.com/hl475	2025-02-08 01:52:20 +00:00
cyy	6cb2f737ee	Enable Windows tests (#146666 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146666 Approved by: https://github.com/albanD	2025-02-08 00:55:20 +00:00
Isalia20	0ab67299c3	[MPS] lu unpack (#146681 ) Implements lu unpack function on MPS. Haven't added new tests because they are covered by removing the lu_unpack from UNIMPLEMENTED_XFAILLIST in test_mps with `test_output_match` function Pull Request resolved: https://github.com/pytorch/pytorch/pull/146681 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-02-08 00:16:17 +00:00
Gregory Comer	803661526e	Update ET pin to 41e7ffa (#145831 ) ExecuTorch pin is failing to update due to a change in the executorch install scripts. The previous install_requirements.sh now only installs dependencies and does not build ET. There is a new script - install_executorch.sh, which both installs dependencies and builds the framework. This PR updates the relevant CI logic to use install_executorch.sh and bumps the pin forward. This should fix the stuck ET pin. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145831 Approved by: https://github.com/metascroy	2025-02-07 23:52:20 +00:00
Hyunho Yeo	dcac3c3e06	[MTIA] (2/n) Implement PyTorch APIs to query/reset device peak memory usage (#146659 ) Summary: Public summary (shared with Github): This diff implements the correct version of the PyTorch API "max_memory_allocated". Nit: The file previously contained two unit tests with the same name (due to wrong revert); I deleted a deprecated one to revamp the correct version. Test Plan: ``` buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- -r test_max_memory_allocated ``` https://www.internalfb.com/intern/testinfra/testrun/12103424065182810 Reviewed By: yuhc Differential Revision: D68988435 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146659 Approved by: https://github.com/nautsimon	2025-02-07 23:06:35 +00:00
Dingming Wu	fa34128435	revert PTD's change that leads to signature mismatch of printNcclCommProxyTrace (#146453 ) Summary: D68801098 introduced this function signature mismatch issue for printNcclCommProxyTrace. Revert it so that trunk build can pass. Test Plan: With the change, build of APS model using rcclexp can now pass: `sh scripts/ltian/run_jobs/fb_fm_v2/run_fb_fm_v2_job.sh -h T20_GTT_MI300X -n 16 -b 1024 -t [2024-12-06] -d ai_infra_ngs -e ai_infra_training_rnd_tc -x 0` Reviewed By: c-p-i-o Differential Revision: D69149588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146453 Approved by: https://github.com/c-p-i-o	2025-02-07 22:43:52 +00:00
Avik Chaudhuri	103c8b44bc	move and fix logic to update unbacked bindings (#146115 ) Summary: Previously we were touching up unbacked bindings between Dynamo and AOTAutograd in strict export, but the logic had a bug: if an unbacked symint gets substituted by a backed symint, we would put the backed symint in the unbacked bindings (the check `is_symbol` was not enough here). This PR fixes this logic, and moreover, moves it into the serializer instead, because we don't need this adjustment outside serde. Test Plan: added test D68880766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146115 Approved by: https://github.com/pianpwk	2025-02-07 22:41:19 +00:00
Lu Fang	45d35f5f5a	Clean up op BC check list (#146577 ) Summary: Remove the expired ones Test Plan: ci Differential Revision: D69226556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146577 Approved by: https://github.com/hl475	2025-02-07 22:40:49 +00:00
Henry Hu	908133f682	[TreeSpec] Add custom comparision function (#146442 ) Summary: https://github.com/pytorch/pytorch/pull/145815 used caching to for treespec_loads calculation to speed up AOTI module call. However, this made tests flaky due when comparing TreeSpec for objects in local scope. ie. 'test_export.TestExport.test_pytree_register_nested_data_class.<locals>.Inner' Type comparison will yield False when local scopes are different due to lru_cache. Since this comparison is only used for testing purpose, we will only test if str(type) are equal. Test Plan: ``` PYTORCH_TEST_WITH_ROCM=1 python test/export/test_retraceability.py ``` Differential Revision: D69137706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146442 Approved by: https://github.com/angelayi	2025-02-07 22:39:21 +00:00
drisspg	91dfa82981	[FlexAttention] Fix dynamic shapes in max-autotune (#146657 ) # Fixes https://github.com/pytorch/pytorch/issues/146624 ### Updated From offline discussion going w/ sizehint However this does incur guards. I couldn't really think of a fancy way to do this. I was going to do `V.graph.sizevars.size_hint` w/ some default for num blocks, but we ultimately need some information about the input. I am also not sure if size_hint is ALWAYS guaranteed to return the runtime value. I think it would be okay to not supported unbacked symints (maybe). For instance, in the repro, we quickly hit the recompile limit. ```Shell torch._dynamo hit config.recompile_limit (8) function: 'flex_attention' (/home/drisspg/meta/pytorch/torch/nn/attention/flex_attention.py:1161) last reason: 0/0: tensor 'L['key']' size mismatch at index 2. expected 1, actual 546 To log all recompilation reasons, use TORCH_LOGS="recompiles". To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146657 Approved by: https://github.com/Chillee, https://github.com/yanboliang	2025-02-07 22:34:28 +00:00
Jason Ansel	579b9f2ed9	[inductor] Better exception error messages for cache_on_self (#146652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146652 Approved by: https://github.com/yanboliang	2025-02-07 21:22:21 +00:00
Jason Ansel	04ce02182b	[inductor] Use index_dtype (int32/int64 depending on size) for argmax accumulators (#146651 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146651 Approved by: https://github.com/shunting314, https://github.com/eellison	2025-02-07 21:21:21 +00:00
PyTorch MergeBot	80a1696679	Revert "[cuBLAS][cuBLASLt] Unify `cuBLASLt` workspaces with `cuBLAS` workspaces (#145130 )" This reverts commit `5f0901e573`. Reverted https://github.com/pytorch/pytorch/pull/145130 on behalf of https://github.com/atalman due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/145130#issuecomment-2644122846))	2025-02-07 21:04:23 +00:00
Henry Tsang	206ad9f4ad	[cutlass backend] Set no fallback to aten, disabled a few broken tests, default to test on H100 (#146554 ) This PR does a few things: * set fall back to aten to False for most tests. Without this, a lot of tests would fail silently since they just use aten * Disable two subprocess related broken tests. They would crash in subprocess. More investigation needed. * remove/disable the tests on A100. Let me elaborate a bit more. There are two types of A100 tests. * normal tests that also test A100. e.g., mm, addmm, bmm. However, since the shift to cutlass 3x, they don't work anymore. GenerateSM80 would generate ops that use cutlass 2x, but they get filtered out since they are of GemmKind.Universal but only GemmKind.Universal3x are supported in the 3x template. * tests for A100 only. The mixed mm and sparse semi structure tests are failing due to "TypeError: can't multiply sequence by non-int of type 'str'" for a while. Disabled them for now. Do let us know if you are about them @alexsamardzic Differential Revision: D69209929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146554 Approved by: https://github.com/chenyang78	2025-02-07 19:59:28 +00:00
PyTorch MergeBot	f17109bd96	Revert "windows Magma build for cu128 (#146653 )" This reverts commit `9e27d36e2b`. Reverted https://github.com/pytorch/pytorch/pull/146653 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/146653#issuecomment-2643882976))	2025-02-07 19:37:16 +00:00
Shunting Zhang	bc0191802f	[inductor] add size-asserts for fallback ops (#145904 ) Fix https://github.com/pytorch/pytorch/issues/144717 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145904 Approved by: https://github.com/jansel	2025-02-07 18:44:32 +00:00
Gabriel Ferns	b60f630de8	fuzzer: disable "fail_on_recompile_limit_hit" and "suppress_errors" (#146650 ) Summary: needed for https://github.com/pytorch/pytorch/pull/146513 Test Plan: the existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/146650 Approved by: https://github.com/xmfan	2025-02-07 18:25:00 +00:00
Ting Lu	9e27d36e2b	windows Magma build for cu128 (#146653 ) https://github.com/pytorch/pytorch/issues/145570 removing `.ci/pytorch/windows/internal/cuda_install.bat` as it is a duplicate with` .github/scripts/windows/cuda_install.bat`. The later one is the one in use - https://github.com/pytorch/pytorch/pull/146653/files#diff-613791f266f2f7b81148ca8f447b0cd6c6544f824f5f46a78a2794006c78957bR8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146653 Approved by: https://github.com/atalman	2025-02-07 18:09:30 +00:00
Tristan Rice	23af9dde4d	distributed/serialization: add experimental streaming torch.save/load methods (#146555 ) Summary: This is intended for use with torchft when we need to do a streaming state dict transfer. This is strictly superior to the prior streaming method in torchft as this supports all tensor subclasses such as DTensor. This supports 100% of the inputs to torch.save/load but is not wire compatible nor intended to have any backwards compatibility. Security wise this fully supports weights_only and defaults to True. It does use pickle for some metadata but uses weights_only for the metadata. Adapted from: https://github.com/pytorch/torchft/pull/101 https://github.com/pytorch/torchft/pull/54 Test Plan: pytest test/distributed/test_serialization.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/146555 Approved by: https://github.com/fegin, https://github.com/mikaylagawarecki Co-authored-by: Krishn Parasar <76171905+Krishn1412@users.noreply.github.com>	2025-02-07 18:08:11 +00:00
Tristan Rice	68631f6e87	PyWork: preserve Python reference counting when used in functional collectives (#146376 ) @fegin found an issue where torchft is not compatible with functional collectives. Found in https://github.com/pytorch/torchtitan/pull/806 The root cause is because PyProcessGroup/PyWork are not compatible with functional collectives due to a nasty ownership bug. PyWork relies on a pybind trampoline to propagate requests to Python unfortunately the way Pybind works is that the Python object owns the C++ object rather than some form of shared ownership. Thus what happens is that the PyWork Python object will collected when returned to C++ from the PyProcessGroup but the C++ PyWork object still exists. When the PyWork object is used, this causes a deadlock as the corresponding Python object no longer exists To solve this, we introduce a new `PyWorkHolder` class which holds a reference to the `py::object` as well as the trampoline class. This resolves any dependency issues since we can now hold ownership in C++ to both the Python and C++ objects. To make this cleaner we introduce a `WORK_OVERRIDE` macro which is a patched version of `PYBIND11_OVERRIDE` that returns a `PyWorkHolder` rather than just `PyWork` and use for all collectives in PyProcessGroup. Test plan: ``` cd pytorch pytest test/distributed/test_c10d_functional_native.py ``` ``` cd torchft pytest torchft/process_group_test.py -k functional -v -x -s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/146376 Approved by: https://github.com/yifuwang	2025-02-07 18:07:53 +00:00
James Wu	76c8a2dc48	Fix get_top() to return the base level event of the stack, not the most recently started event (#146649 ) `get_top()` is really confusing when talking about a stack, because it can mean the most recently started event on the stack or the toplevel event in perfetto(which displays the stack upside down). Rename to `get_outermost` and fix the bug associated with it, so that it returns the correct value out of the stack. Running nanogpt now puts `guard_latency_us` correctly in the `dynamo` event: ``` tlp python benchmarks/dynamo/torchbench.py --backend inductor --device cuda --only nanogpt --amp --cold-start-latency --print-compilation-time --training --performance 2>&1 --dynamic-shapes \| tee out.log ``` <img width="1281" alt="image" src="https://github.com/user-attachments/assets/4eeb371a-4d81-415a-acc4-7d303a4b2a93" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/146649 Approved by: https://github.com/masnesral, https://github.com/anijain2305	2025-02-07 18:04:50 +00:00
briancoutinho	f138b18d18	[inductor/profiler] add kernel kwargs instrumentation (#145573 ) ## About As above, record the kernel launch kwargs. These tends to be contexpr arguments to triton kernels like block size etc. ## Test program Note, install triton before proceeding (pip install triton) triton_test.py>>> ``` import torch from torch.profiler import profile, ProfilerActivity def foo(x, y): a = torch.sin(x) b = torch.cos(y) return a + b def main(): x = torch.randn(10, 10).cuda() y = torch.randn(10, 10).cuda() opt_foo = torch.compile(foo) z = opt_foo(x, y) # Profile the kernel function on the GPU with profile( activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA], record_shapes=True ) as prof: z = opt_foo(x, y) # Export the trace to a file prof.export_chrome_trace("my_kernel_trace.json") if __name__ == "__main__": main() ``` Run it and we should get a trace file my_kernel_trace.json Output has triton event with the kernel_kwargs attribute. ``` { "ph": "X", "cat": "cpu_op", "name": "triton_poi_fused_add_cos_sin_0", "pid": 2480815, "tid": 2480815, "ts": 2045246693014.959, "dur": 75.662, "args": { ... "kernel_backend": "triton", "num_warps": 4, "kernel_kwargs": "XBLOCK=128", "num_stages": 1, "grid": "grid(100,)", "kernel_file": "/tmp/torchinductor_bcoutinho/ow/cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor.py", "kernel_hash": "cowpmkdpla4qfqj6jupnq4d7og7iz7eeb5wergubivubxd4xapor" } }, ``` ## Unit Test Updated unit test: ``` pytest test/inductor/test_profiler.py -k test_pt2_triton_attributes ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145573 Approved by: https://github.com/davidberard98, https://github.com/jansel	2025-02-07 17:44:30 +00:00
Animesh Jain	ee45ea599d	[dynamo] Actionable message on recompilations for fullgraph=True (#146550 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146550 Approved by: https://github.com/zou3519, https://github.com/StrongerXi ghstack dependencies: #146553	2025-02-07 17:28:43 +00:00
Animesh Jain	fa0956951c	[dynamo] Remove the suggestion to use suppress_errors on compiler error (#146553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/146553 Approved by: https://github.com/zou3519, https://github.com/jansel	2025-02-07 17:28:43 +00:00
cyy	25aa7ca62d	Cleanup CallOnce.h (#146700 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146700 Approved by: https://github.com/albanD	2025-02-07 16:44:45 +00:00
PyTorch MergeBot	076717785c	Revert "[while_loop][inductor] support sym expression as cond_fn output (#146222 )" This reverts commit `5ecdc428b2`. Reverted https://github.com/pytorch/pytorch/pull/146222 on behalf of https://github.com/atalman due to Internal failure, please see associated diff ([comment](https://github.com/pytorch/pytorch/pull/146222#issuecomment-2643379933))	2025-02-07 16:19:41 +00:00
eqy	5d7532140f	[CUDA][CUDA Graphs] Fix debug mode warning message (#145996 ) The real method is `enable_debug_mode()`, `_cuda_enable_graphs_debug_mode` does not exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145996 Approved by: https://github.com/ptrblck, https://github.com/eellison	2025-02-07 08:04:49 +00:00
eellison	002accfb8d	Check meta strides for expanded dims in effn_attn_bias (#146054 ) With the `_scaled_dot_product_efficient_attention.default`, we have lowering logic to realize the bias to specific alignment constraints. Some of the dims can be expanded, and we need to keep the stride of that dim to 0 to avoid materializing a larger tensor than we need. Previously, we had checked stride of tensor, but if it is not realized, that will not work. so we should check the strides of the meta as well. Note: getting the exact of realizing/slicing/requiring_exact_strides was a little tricky. I commented to @exclamaforte on an example unable-to-fuse message you get if you do it incorrectly. Fix for https://github.com/pytorch/pytorch/issues/145760 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146054 Approved by: https://github.com/shunting314	2025-02-07 06:35:57 +00:00
eellison	71e8a2bda4	Expand inductor codegen dtype asserts, fix scan (#146067 ) We were codegening intermediary dtype asserts in some places but not all. expands assertions, fixes newly failing assertion in `TORCHINDUCTOR_COMPILE_THREADS=1 TORCH_LOGS="output_code" PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 python test/inductor/test_torchinductor_opinfo.py TestInductorOpInfoCUDA.test_comprehensive_logcumsumexp_cuda_float16` for scan. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146067 Approved by: https://github.com/shunting314, https://github.com/jansel	2025-02-07 06:35:47 +00:00
cyy	f6bd20e8a2	Enable TemporaryFileName tests on Windows (#146311 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146311 Approved by: https://github.com/albanD	2025-02-07 06:06:18 +00:00
Pian Pawakapan	1c872803cb	[export][dynamic shapes] log provenance for locals & symbols for non-strict (#143378 ) Adds `dtrace_structured` logging so when a guard or real-tensor propagation assert is added, the relevant user code with local symbolic values & free symbols are logged, e.g. from the draft export CLI report (soon to be added to tlparse): 1. Guard added: ``` 1. Constraint violation error. The specified input dynamic_shapes spec was found to be incorrect during tracing. Specifically, this guard was added: Eq(s0, 3), where {'s0': "L['args'][0][0].size()[0]"}. This occured at the following stacktrace: File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 267, in forward: assert a.shape[0] == 3 Locals: a: Tensor(shape: torch.Size([s0, 3]), stride: (3, 1), storage_offset: 0) Symbols: s0: L['args'][0][0].size()[0] ... ``` 2. Real tensor propagation: ``` 1. Data dependent error. When exporting, we were unable to evaluate the value of `u2 < 0`. This was encountered 8 times. This occurred at the following stacktrace: File /data/users/pianpwk/pytorch/test/export/test_draft_export.py, lineno 217, in forward: return res[:c_item] Locals: res: Tensor(shape: torch.Size([u0, u1]), stride: (Max(1, u1), 1), storage_offset: 0) c_item: u2 ... ``` Currently the values are extracted from the traceback, and are only valid for non-strict; strict seems to require storing & fakifying locals in the frames reporting by `TracingContext`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143378 Approved by: https://github.com/avikchaudhuri, https://github.com/bobrenjc93	2025-02-07 05:46:05 +00:00
Aaron Gokaslan	bc40ccf6aa	[BE]: Inline special functions for MPS (#146627 ) These header functions should be inlined for consistency and to avoid translation unit / symbol issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146627 Approved by: https://github.com/dcci	2025-02-07 05:15:15 +00:00
Zhou32	ecf44d1002	Fixed a typo in dataset.py (#146600 ) Changed word 'Mult' to 'Multi'. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146600 Approved by: https://github.com/Skylion007	2025-02-07 05:09:51 +00:00
Justin Chu	41e6d189a3	[ONNX] Create deprecation warning on dynamo_export (#146425 ) Reland #146003 Deprecation of `torch.onnx.dynamo_export`: * [`torch/onnx/_internal/_exporter_legacy.py`]: Added deprecation warnings to the `OnnxRegistry`, `ExportOptions`, `ONNXRuntimeOptions`, and `dynamo_export` functions, indicating that `torch.onnx.dynamo_export` is deprecated since version 2.6.0 and should be replaced with `torch.onnx.export(..., dynamo=True)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146425 Approved by: https://github.com/titaiwangms, https://github.com/atalman	2025-02-07 04:20:46 +00:00
cyy	fa0592b568	Remove some NOLINT (#146610 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/146610 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-02-07 01:50:06 +00:00

1 2 3 4 5 ...

84194 commits