pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
bobrenjc93	6f07847efe	Bail on checking internal overlap when dealing with unbacked symints (#145385 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145385 Approved by: https://github.com/ezyang	2025-01-23 22:31:31 +00:00
Richard Zou	e1407f5aeb	[compiled_autograd] Rename interface to pyinterface (#145495 ) Summary: interface is a reserved word in some MSVC variants. Test Plan: build Differential Revision: D68561379 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145495 Approved by: https://github.com/xmfan	2025-01-23 21:40:59 +00:00
Shangdi Yu	302b07f166	Implement deepcopy for AOTICompiledModel (#145423 ) Summary: Fix https://github.com/pytorch/pytorch/issues/145411 Support deepcopying AOTICompiledModel. The `loader` is shallow copied. Test Plan: ``` buck2 run fbcode//mode/opt //caffe2/test/inductor:aot_inductor_package -- -r deepcopy ``` Differential Revision: D68524673 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145423 Approved by: https://github.com/desertfire	2025-01-23 21:05:30 +00:00
Davide Italiano	e924ddbef1	[BE] [mps] Refactor UnaryConstants to be its own kernel. (#145230 ) In preparation for using this file for inductor (for erfinv). Pull Request resolved: https://github.com/pytorch/pytorch/pull/145230 Approved by: https://github.com/malfet	2025-01-23 20:58:43 +00:00
Daulet Askarov	881eb86692	Fix staging for CPU tensors in OSS DCP async_save (#145408 ) Fix staging for CPU tensors in OSS DCP async_save (#145408) Summary: As found in https://github.com/pytorch/pytorch/issues/144657 for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug. Test Plan: Running the user script from the linked GitHub issue verifies the fix: ``` import os import torch import torch.distributed as dist import torch.distributed.checkpoint as dcp from torch.distributed.checkpoint.state_dict import get_model_state_dict import torch.nn as nn class Net(nn.Module): def __init__(self): super().__init__() self.weight = nn.Parameter(torch.ones(1, 1)) def forward(self, x): return self.layer(x) os.environ["MASTER_ADDR"] = "localhost" os.environ["MASTER_PORT"] = "12345" os.environ["WORLD_SIZE"] = "1" os.environ["RANK"] = "0" dist.init_process_group() model = Net() state_dict = get_model_state_dict(model) pg = dist.new_group(backend="gloo") try: steps = [10, 20, 30, 40, 50] future = None for step in steps: # simulate a training step, e.g. optimizer updating values with torch.no_grad(): model.weight.data.fill_(step) if future is not None: future.result() future = None future = dcp.async_save( state_dict, checkpoint_id=f"outputs/{step}", process_group=pg, ) future.result() for step in steps: dcp.load( state_dict, checkpoint_id=f"outputs/{step}", process_group=pg, ) assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}" finally: dist.destroy_process_group(pg) dist.destroy_process_group() ``` passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert. Differential Revision: D68518689	2025-01-23 12:49:26 -08:00
Bin Bao	6a44a61514	[BE] Bump TIMM pin (#145320 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145320 Approved by: https://github.com/Skylion007	2025-01-23 20:44:26 +00:00
Pian Pawakapan	99367ecbed	[draft export] count how many times a data-dep error shows up (#145030 ) Summary: maybe this is helpful? Test Plan: draft_export Differential Revision: D68303934 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145030 Approved by: https://github.com/angelayi	2025-01-23 20:27:31 +00:00
Aaron Gokaslan	5ebca3015d	[BE]: Simplify set add with set update (#145152 ) Simplifies the set update slightly to be more readable and efficient. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145152 Approved by: https://github.com/XuehaiPan, https://github.com/albanD Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>	2025-01-23 20:18:13 +00:00
PyTorch MergeBot	d7b6746470	Revert "Fix deprecated pytorch_sphinx_theme editable installation (#145347 )" This reverts commit `c27dd9cf72`. Reverted https://github.com/pytorch/pytorch/pull/145347 on behalf of https://github.com/huydhn due to Remove -e breaks the theme somehow ([comment](https://github.com/pytorch/pytorch/pull/145347#issuecomment-2610911258))	2025-01-23 20:06:07 +00:00
Pian Pawakapan	d53f2067fe	[BE][export] add "+export" logging to de/serialization (#145283 ) adds de/serialization debug logging to `TORCH_LOGS="+dynamic"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145283 Approved by: https://github.com/ydwu4, https://github.com/angelayi	2025-01-23 19:47:48 +00:00
PyTorch MergeBot	ce4a097bf7	Revert "Added swizzle searching, disabled fp16 accum, and enabled ping-pong for cutlass (#144829 )" This reverts commit `55084443ca`. Reverted https://github.com/pytorch/pytorch/pull/144829 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/144829#issuecomment-2610855579))	2025-01-23 19:37:54 +00:00
iremyux	527101fa95	Move Windows arm64 scripts from pytorch/builder (#144317 ) This PR moves the Windows Arm64 scripts from the builder repository to the main repository. The corresponding PR to pytorch/builder that removes them is here : https://github.com/pytorch/builder/pull/2058 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144317 Approved by: https://github.com/Skylion007, https://github.com/seemethere Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:29:29 +00:00
Irem Yuksel	66bf7da446	Enable sleef for Win Arm64 (#144876 ) Sleef module was disabled for Windows Arm64 on `b021486405` This PR enables it again since the issue is no longer valid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144876 Approved by: https://github.com/albanD, https://github.com/malfet Co-authored-by: Ozan Aydin <148207261+ozanMSFT@users.noreply.github.com>	2025-01-23 19:22:58 +00:00
Xu Zhao	991a4b5925	[dynamo] Add `--profile-details` and `--export-perfdoctor` option (#144751 ) Summary: Add `--profile-details` option to add shapes and other details to the Kineto profile. Add `--export-perfdoctor` to directly dump trace to perfdoctor for webview. Test Plan: ``` $ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench_internal -- --only mrs_video_watch_over --performance --training --amp --export-profiler-trace --backend=inductor --profile-details --export-perfdoctor ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/pyper_traces/tree/traces/test/inductor_mrs_video_watch_over_rank_0_20250113_173817_6535183793.json.gz Differential Revision: D68134547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144751 Approved by: https://github.com/drisspg	2025-01-23 19:09:40 +00:00
Renato Arantes	5b37249259	Enable fp16 linear layers in PyTorch via ACL (#144992 ) This pull request aims to enable the use of linear layers with the fp16 data type through the ACL. On a Graviton3 instance running with 16 threads, `torch.randn(2048, 4096, dtype=torch.half)` will take 50+% less time to complete compared with `torch.randn(2048, 4096, dtype=torch.float32)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144992 Approved by: https://github.com/ng-05, https://github.com/digantdesai, https://github.com/malfet	2025-01-23 19:07:54 +00:00
Yang Wang	6d4f5f7688	[Utilization][Usage Log] Add data model for record (#145114 ) Add data model for consistency and data model change in the future. The data model will be used during the post-test-process pipeline Pull Request resolved: https://github.com/pytorch/pytorch/pull/145114 Approved by: https://github.com/huydhn	2025-01-23 19:04:41 +00:00
Joona Havukainen	2f317bbdbc	Missing autorelease in lstm_mps caused a ton of leaked memory (#145503 ) The dictionary held onto the new MPSGraphTensorData objects and MPSNDArrays. Regression caused by https://github.com/pytorch/pytorch/pull/95137 Fixes #145374 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145503 Approved by: https://github.com/Skylion007, https://github.com/malfet	2025-01-23 18:54:30 +00:00
Nikhil Gupta	41b38f755c	Revert "Reverting the PR adding Kleidiai-based int4 kernels (#145392 )" (#145505 ) https://github.com/pytorch/pytorch/pull/134124 was reverted by https://github.com/pytorch/pytorch/pull/145392 due to KleidiAI clone issue. 1. This reverts commit `0940eb6d44` (https://github.com/pytorch/pytorch/pull/145392 )and Fixes KleidiAI mirror issue. 2. KleidiAI is now cloned from github mirror instead of arm gitlab Change-Id: I7d6eee7214cd117d3057d615936fcc3ee6052fa2 Fixes https://github.com/pytorch/pytorch/issues/145273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145505 Approved by: https://github.com/malfet	2025-01-23 18:50:59 +00:00
Simon Fan	34b8d8b0c0	update compile time benchmarks to dump compile times to stdout and csv (#145447 ) ```python # inductor.csv dev,name,batch_size,accuracy,calls_captured,unique_graphs,graph_breaks,unique_graph_breaks,autograd_captures,autograd_compiles,cudagraph_skips,compilation_latency cuda,cait_m36_384,8,pass,2510,1,0,0,0,0,0,87.705186 ``` ```python loading model: 0it [01:27, ?it/s] cuda eval cait_m36_384 Compilation time (from dynamo_timed): 87.705186276 # <---------------- pass TIMING: _recursive_pre_grad_passes:0.11023 pad_mm_benchmark:0.50341 _recursive_joint_graph_passes:3.88557 _recursive_post_grad_passes:6.71182 async_compile.wait:4.16914 code_gen:17.57586 inductor_compile:42.55769 backend_compile:72.47122 entire_frame_compile:87.70519 gc:0.00112 total_wall_time:87.70519 STATS: call_* op count: 2510 \| FakeTensorMode.__torch_dispatch__:101743 \| FakeTensor.__torch_dispatch__:12959 \| ProxyTorchDispatchMode.__torch_dispatch__:41079 Dynamo produced 1 graphs covering 2510 ops with 0 graph breaks (0 unique) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145447 Approved by: https://github.com/ezyang	2025-01-23 18:49:19 +00:00
Boyuan Feng	629fb1590c	[BE] Type annotate pad_mm.py (#145409 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145409 Approved by: https://github.com/Skylion007	2025-01-23 18:34:24 +00:00
Animesh Jain	015c6d6fdb	[dynamo][guards] Turn on profiling of guard manager (#145420 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145420 Approved by: https://github.com/ezyang ghstack dependencies: #145351	2025-01-23 18:17:43 +00:00
Zheng, Zhaoqiong	fef92c9447	Fix IdentationError of code example (#145251 ) I found there is IndentationError when try to copy paste the example of inference with torch.compile fix the format in this pr Pull Request resolved: https://github.com/pytorch/pytorch/pull/145251 Approved by: https://github.com/mikaylagawarecki Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2025-01-23 18:17:11 +00:00
Boyuan Feng	9a5bc7b6dd	[BE] Type annotate metrics.py (#145418 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145418 Approved by: https://github.com/Skylion007	2025-01-23 18:13:59 +00:00
Yidi Wu	bdc2c2a237	[be] fix flaky test aot_export_ cond caused by free symbol lifting and automatic dynamic shape (#145330 ) Fixes https://github.com/pytorch/pytorch/issues/139998#issuecomment-2605908426. It seems to be an issue caused by the interaction between dynamoed hop X automatic dynamic shape X auto_lift_free symbols. The immediate error is that the asserteExpectedInline of the graph can sometimes be different e.g. see https://hud.pytorch.org/flakytest?name=test_aot_export_with_torch_cond&suite=TestAOTExport&limit=100, where sometimes the shapes are lifted as input to the cond and sometimes they're not. The root cause of the flakyness is that the two invocations of torch.cond triggers two torch.compile on the same code object ([code](https://github.com/pytorch/pytorch/blob/main/torch/_higher_order_ops/cond.py#L192)), and triggers automatic dynamic shape because in test_aot_export_with_torch_cond, x has shape (3, 4) while the pre_dispatch one has shape (2, 2). Because of we auto lift free symbols for dynamic shaped input, this causes cond sometimes have the shape as arguments and sometimes not. This PR adds a simple fix by adding a _dynamo.reset before each torch.cond tests. This fixes the error by not triggering automatic dynamic shape. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145330 Approved by: https://github.com/zou3519	2025-01-23 18:12:58 +00:00
Yidi Wu	3c247ee8c4	[hop][be] add utils for more comprehensive input alias and mutation (#145298 ) This PR implements the idea of checking input mutations through tensor version and check aliasing via storage from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account. When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298 Approved by: https://github.com/zou3519	2025-01-23 18:12:28 +00:00
Manuel Candales	b0f3597133	Add fused rms_norm implementation for MPS backend (#145301 ) Adding a fused rms_norm implementation for MPS backend. This eliminates most of the current CPU overhead, making this computation GPU bound and improving latency of rms_norm by 30x-40x on MPS backend The metal shader was adapted from MLX: `e6a7ab9675/mlx/backend/metal/kernels/rms_norm.metal` The numbers below are averages over 1000 runs of RMSNorm, obtained on an M1 Pro. Benchmarking Results (Before): ``` Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 140.5 \| 171.0 \| 170.4 \| 10.9 \| 13.3 \| 13.5 ``` Benchmarking Results (After): ``` Device : MPS \| CPU Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16 Outputs Match : True \| True \| True \| True \| True \| True Average Time (us) : 4.0 \| 3.9 \| 3.9 \| 10.0 \| 12.4 \| 13.0 ``` Profiling Results (Before): ``` ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rms_norm 2.35% 3.284ms 100.00% 140.038ms 140.038us 1000 aten::mul 33.61% 47.068ms 33.61% 47.068ms 23.534us 2000 aten::pow 17.04% 23.868ms 17.43% 24.402ms 24.402us 1000 aten::add_ 16.52% 23.130ms 16.78% 23.497ms 23.497us 1000 aten::mean 15.82% 22.151ms 15.82% 22.151ms 22.151us 1000 aten::rsqrt 13.63% 19.085ms 13.71% 19.198ms 19.198us 1000 aten::item 0.46% 639.370us 0.56% 788.376us 0.394us 2000 aten::type_as 0.21% 295.507us 0.27% 371.291us 0.371us 1000 aten::to 0.13% 177.742us 0.13% 177.742us 0.059us 3000 aten::_local_scalar_dense 0.11% 149.006us 0.11% 149.006us 0.075us 2000 ----------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 140.038ms ``` Profiling Results (After): ``` ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ aten::rms_norm 63.21% 832.875us 100.00% 1.318ms 1.318us 1000 aten::empty_like 16.06% 211.631us 36.79% 484.681us 0.485us 1000 aten::empty_strided 20.72% 273.050us 20.72% 273.050us 0.273us 1000 ----------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 1.318ms ``` Benchmarking and profiling script: ```python import torch import torch.nn as nn from torch.profiler import profile import time def benchmark(device, dtype): model = nn.RMSNorm(2048, device=device) # Create example inputs x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype) w = torch.randn(2048, requires_grad=False, device=device, dtype=dtype) eps = 1e-5 # Check output y = torch.ops.aten.rms_norm(x, [2048], w, eps) z = torch.ops.aten.rms_norm(x.cpu(), [2048], w.cpu(), eps) outputs_match = torch.allclose(y.cpu(), z) # Measure time manually start_time = time.time() * 1000 for _ in range(1000): with torch.no_grad(): y = model(x) torch.mps.synchronize end_time = time.time() * 1000 manual_delta = (end_time - start_time) average_time = f"{manual_delta:6.1f}" return outputs_match, average_time outputs_match_list = [] average_time_list = [] for device in ["mps", "cpu"]: for dtype in [torch.float32, torch.float16, torch.bfloat16]: outputs_match, average_time = benchmark(device, dtype) outputs_match_list.append(str(outputs_match)) average_time_list.append(average_time) print("\nBenchmarking Results:") print("---------------------") print("Device : MPS \| CPU") print("Dtype : FP32 \| FP16 \| BF16 \| FP32 \| FP16 \| BF16") print(f"Outputs Match : ", " \| ".join(outputs_match_list)) print(f"Average Time (us) :", " \|".join(average_time_list)) device = "mps" dtype = torch.float32 model = nn.RMSNorm(2048, device=device) x = torch.randn(1, 1, 2048, requires_grad=False, device=device, dtype=dtype) # Run and profile the model with profile() as prof: with torch.no_grad(): for _ in range(1000): y = model(x) torch.mps.synchronize # Print profiling results print("\n\nProfiling Results (MPS/FP32):") print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145301 Approved by: https://github.com/malfet	2025-01-23 18:07:10 +00:00
Ryan Guo	a86fa779ce	[BE] Fix edge case in translation validation bisector (#145414 ) This patch fixes a small bug for the binary-search algorithm in translation validation bisector. Fixes #131303. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145414 Approved by: https://github.com/ysiraichi, https://github.com/zou3519	2025-01-23 17:35:28 +00:00
Sam Larsen	045698653a	[BE] Remove test_ops_gradients from FIXME_inductor_dont_reset_dynamo (#145308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145308 Approved by: https://github.com/zou3519 ghstack dependencies: #145306	2025-01-23 17:25:04 +00:00
Bartlomiej Stemborowski	3a8d3785f7	[ca][bug_fix] Fix ref counting of objects in the set_autograd_compiler function. (#145482 ) PR#141153 exposed the option to collect sizes as dynamic. After this change, the function set_autograd_compiler returns PyTuple object which is populated using PyTuple_SET_ITEM function. Yet, that function steals reference to the object and doesn't INCREF it. So currently we are missing INCREF on prior_compiler when it is Py_None and INCREF on prior_dynamic which is either Py_False or Py_True. This bug may lead to the possible memory corruption. @xmfan @jansel @albanD Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482 Approved by: https://github.com/albanD, https://github.com/jansel	2025-01-23 17:13:56 +00:00
drisspg	c6707734de	Enable non power of 2 head_dim for FlexAttention (#133495 ) # Summary - Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power. - Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now) ### Corollary We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495 Approved by: https://github.com/Chillee	2025-01-23 17:05:38 +00:00
Howard Huang	bf4f8919df	Fix test_modules_can_be_imported (#145387 ) `test_modules_can_be_imported` test is currently failing due to a few missing private modules and this PR gets it working before I start to clean up the public allow list Pull Request resolved: https://github.com/pytorch/pytorch/pull/145387 Approved by: https://github.com/albanD	2025-01-23 16:03:00 +00:00
PyTorch MergeBot	768ad0886f	Revert "Binary upload checksum (#144887 )" This reverts commit `2efa98d69d`. Reverted https://github.com/pytorch/pytorch/pull/144887 on behalf of https://github.com/atalman due to Broke nightly index ([comment](https://github.com/pytorch/pytorch/pull/144887#issuecomment-2610066277))	2025-01-23 15:10:42 +00:00
Wang, Chuanqi	0802e78315	[CD] Disable Kineto for XPU Windows CD (#145255 ) Due to issue #145155, disable Kineto for XPU Windows CD temporally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145255 Approved by: https://github.com/xuhancn, https://github.com/atalman	2025-01-23 14:09:52 +00:00
Aaron Orenstein	629840e038	Backout PEP585 use of Iterable (#145438 ) Summary: Importing Iterable from collections.abc here causes an internal product to fail MRO discovery causing a collision between Iterable and Generic. This fixes the failure on D68461304 Differential Revision: D68531443 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438 Approved by: https://github.com/izaitsevfb	2025-01-23 11:45:37 +00:00
cyy	29f52e3972	[2/N] Remove unnecessary once flag usage (#145057 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/145057 Approved by: https://github.com/albanD	2025-01-23 09:48:46 +00:00
Shunting Zhang	b6941d4e42	[inductor] fix autotuning memory usage (#145410 ) We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`: `6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)` Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage. Note that, with all the following efforts 1. donated buffer 2. inplace padding 3. this PR We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile. The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding. Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile. With 1 & 2, the peak memory is 1. 17.7GB with a cold cache 2. 15.5GB with a warm cache (since the autotuning overhead is skipped) With 1 & 2 & 3, we save 3GB peak memory (18.6GB -> 15.5GB) no matter if autotuning happens or not Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410 Approved by: https://github.com/masnesral, https://github.com/jansel ghstack dependencies: #140249, #145325	2025-01-23 09:34:23 +00:00
amathewc	638903aeee	Adapt Dynamo tests to HPUs using instantiate_device_type_tests (#144387 ) MOTIVATION We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices. Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu ) CHANGES Create a separate class for test functions running on CUDA devices Extend the functionality of these tests to include HPUs Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387 Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey	2025-01-23 09:24:42 +00:00
Shunting Zhang	d3f196909d	[inductor] let inplace-padding support cpp-wrapper (#145325 ) Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth. One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed). The fix is when we allocate the tensor, instead of doing something like: ``` buf0 = randn_strided([2048, 2047], [2048, 1]) ``` we do some small overallocation ``` buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1]) ``` cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325 Approved by: https://github.com/desertfire, https://github.com/jansel ghstack dependencies: #140249	2025-01-23 09:22:38 +00:00
Justin Chu	f52901a0a7	[ONNX] Remove LegacyDynamoStrategy (#145442 ) It's legacy. So remove. Shouldn't affect anything and will facilitate cleaning up our legacy code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145442 Approved by: https://github.com/titaiwangms	2025-01-23 07:56:04 +00:00
Sam Larsen	28c251dd0b	[BE] Remove test_modules from FIXME_inductor_dont_reset_dynamo (#145306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145306 Approved by: https://github.com/zou3519	2025-01-23 06:37:21 +00:00
Davide Italiano	f56c638849	[c10/metal] Add a vectype variant for `short`/`int`/`long` (#145430 ) Some of the kernels (exp_complex/atan_complex) need the specialization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145430 Approved by: https://github.com/malfet, https://github.com/jansel	2025-01-23 04:52:56 +00:00
Animesh Jain	c58198184b	[dynamo][dicts] Insert LENTGH guard on an if condition on dict (#145432 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145432 Approved by: https://github.com/williamwen42, https://github.com/jansel	2025-01-23 04:40:56 +00:00
Andy Lugo	faa10faa2c	[ROCm] CK SDPA - Move arch check to CK patch (#144777 ) __gfxXXX__ should only be visible by device code. Move the check to the ck kernel Pull Request resolved: https://github.com/pytorch/pytorch/pull/144777 Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell, https://github.com/jianyuh	2025-01-23 04:12:25 +00:00
Chirag Pandya	5e6451ea78	[c10] catch c10 error and log message (#145413 ) Summary: Explicitly catch c10 error and log the error message only. The standard exception `e.what()` below ends up logging the stack trace that is confusing users. See S477887 for details. Test Plan: tested locally. ``` buck test caffe2/test/cpp/c10d:TCPStoreTest buck2 daemon constraint mismatch: Version mismatch; killing daemon... Starting new buck2 daemon... Connected to new buck2 daemon. File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp Watchman fresh instance: new mergebase, cleared graph state, cleared dep files Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`. Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`. Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68 Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918 Network: Up: 1.5GiB Down: 4.7GiB (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161) Loading targets. Remaining 0/3024 69199 dirs read, 687558 targets declared Analyzing targets. Remaining 0/31483 1481904 actions, 1719048 artifacts declared Executing actions. Remaining 0/250391 77:11:29.7s exec time total Command: test. Finished 2031 local, 45445 remote, 51473 cache (52% hit) 20:16:36.9s exec time cached (26%) Time elapsed: 7:32.7s Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D68516080 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413 Approved by: https://github.com/fduwjj	2025-01-23 03:45:47 +00:00
Yu, Guangye	719938c77f	Generalize pin memory logic for accelerator when non blocking copy happened (#143783 ) # Motivation fix https://github.com/pytorch/pytorch/issues/143641 Generalize pin memory logic for accelerator when non-blocking copy happened. Each accelerator has its implementation on `empty_strided`. The accelerator which doesn't have pin memory mechanism could ignore or mimic when pin_out is True. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143783 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #144959	2025-01-23 03:43:05 +00:00
Yu, Guangye	28b6430823	Introduce a new API isAcceleratorExcluded (#144959 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/144959 Approved by: https://github.com/albanD	2025-01-23 03:43:05 +00:00
Animesh Jain	5a18f1e1eb	[dynamo] Support fx map_aggregate (#145351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145351 Approved by: https://github.com/zou3519	2025-01-23 03:19:30 +00:00
PyTorch MergeBot	d95a6babcc	Revert "Align CPU behavior with CUDA for `ConvTranspose` when `out_channels=0` (#142859 )" This reverts commit `0bff377880`. Reverted https://github.com/pytorch/pytorch/pull/142859 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the XLA failures look legit ([comment](https://github.com/pytorch/pytorch/pull/142859#issuecomment-2608631019))	2025-01-23 01:10:31 +00:00
albanD	0d28188cc8	Move privateuse1 test out of test_utils and make them serial (#145380 ) Fixes https://github.com/pytorch/pytorch/issues/132720 The reason is that changing the privateuse1 module is global and so can race when other tests happen to check if it is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145380 Approved by: https://github.com/Skylion007, https://github.com/janeyx99	2025-01-23 00:31:39 +00:00
amdfaa	c9e12d6a3b	[ROCm] Update rocm.yml and add rocm-mi300.yml (#145398 ) - Added another workflow to run the mi300 jobs post-merge. - Updated rocm.yml to use mi200s instead of mi300s. - Required to get an idea of how PRs are landing on our mi200s and mi300s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398 Approved by: https://github.com/jeffdaily Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-01-23 00:07:50 +00:00

1 2 3 4 5 ...

83505 commits