pytorch

mirror of https://github.com/saymrwulf/pytorch.git synced 2026-05-14 20:57:59 +00:00

Author	SHA1	Message	Date
Mikayla Gawarecki	5ba6bb7b2f	Add swap_tensors path to nn parametrizations (#124130 ) Fixes #123859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124130 Approved by: https://github.com/albanD	2024-04-18 22:22:08 +00:00
Wei Wei	87f651c7e7	fix cpu test errors (#124116 ) Similar fix is from @int3 but not landed. Credit to @int3 too. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124116 Approved by: https://github.com/chenyang78	2024-04-18 20:30:58 +00:00
ydwu4	2e48b39603	Fix example_value of map (#124203 ) Previously, we didn't expand the shape of example_value of map to the same as inputs (edit: the first mapped dimension). This pr fixes this bug. To make this easier, we change _call_function_and_unflatten_output to accept example_values directly instead of retrieving them from the variable trackers. Also remove a redundant call function node in strict_mode higher order op in dynamo. Test Plan: existing tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124203 Approved by: https://github.com/ezyang, https://github.com/zou3519	2024-04-18 19:18:36 +00:00
PyTorch MergeBot	4a0900d04b	Revert "[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 )" This reverts commit `ef93402f61`. Reverted https://github.com/pytorch/pytorch/pull/124343 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124343#issuecomment-2064937192))	2024-04-18 18:55:48 +00:00
PyTorch MergeBot	61bc188f42	Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 )" This reverts commit `b51f66c195`. Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/malfet due to Broke gcc9 builds ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2064936414))	2024-04-18 18:53:59 +00:00
Sheng Fu	89407eca3b	Capture triton kernel in execution trace (#124140 ) Summary: This DIFF is to capture triton kernels in execution trace. Test Plan: buck test mode/dev-nosan caffe2/test:profiler -- test_execution_trace_with_pt2 Differential Revision: D56162599 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124140 Approved by: https://github.com/briancoutinho	2024-04-18 18:38:26 +00:00
angelayi	74bedbb9e1	[export] Serialize rational symint ranges (#123884 ) Some symints result in rational ranges like 10/3 which runs into an error ([example](https://www.internalfb.com/intern/everpaste/?handle=GMG2AxkeoFUrh-UDAFcE8pKPgjoUbsIXAAAB)). Ed will eventually get rid(?) of these rational ranges but as a workaround export can just clamp the results during serialization time Pull Request resolved: https://github.com/pytorch/pytorch/pull/123884 Approved by: https://github.com/zhxchen17	2024-04-18 18:20:11 +00:00
egienvalue	b6f0159db0	Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 ) Test the generic torch.Stream/Event with fake device gurad and hooks. @exported-using-ghexport Differential Revision: [D55902506](https://our.internmc.facebook.com/intern/diff/D55902506/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123614 Approved by: https://github.com/albanD ghstack dependencies: #123611, #123612	2024-04-18 17:40:13 +00:00
Aaron Orenstein	37215a4fa2	Fix memory leak in pattern_matcher (#124345 ) #121313 changed precompiled patterns so they are more integrated with the pattern matching code. This resulted with a list of "known" patterns (with their example data) being stored globally. Unfortunately since small FakeTensors store a constant of the original tensor it meant that we leaked cuda tensors in the example data. Fix this by clearing out the constant storage for the example data that we keep around. Fixes #124081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124345 Approved by: https://github.com/xuzhao9	2024-04-18 17:38:12 +00:00
egienvalue	d7e1bf9ff9	torch.mtia module for MTIA device backend (#123612 ) MTIA device has its own Module in PyTorch now. torch.mtia has following APIs similar to other backends. The lazy_init is also supported. ``` __all__ = [ "init", "is_available", "synchronize", "device_count", "current_device", "current_stream", "default_stream", "set_stream", "stream", "device", ] ``` ------------ For device management. We expand AccleratorHooksInterface to support generic device management and it can be used in both C++ and PyThon. ``` def _accelerator_hooks_device_count() -> _int: ... def _accelerator_hooks_set_current_device(device_index: _int) -> None: ... def _accelerator_hooks_get_current_device() -> _int : ... def _accelerator_hooks_exchange_device(device_index: _int) -> _int : ... def _accelerator_hooks_maybe_exchange_device(device_index: _int) -> _int : ... ``` --------- Adding get_device_module API to retrieve device modules for different device types. ``` def get_device_module(device: Optional[Union[torch.device, str]] = None) ``` --------- @exported-using-ghexport Differential Revision: [D52923602](https://our.internmc.facebook.com/intern/diff/D52923602/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123612 Approved by: https://github.com/albanD ghstack dependencies: #123611	2024-04-18 17:38:06 +00:00
egienvalue	cb17721899	Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 ) This diff intends to build device generic torch.Stream and torch.Event for newly added accelerators in PyTorch. ------------ torch.Stream APIs ``` # Defined in torch/csrc/Stream.cpp class Stream(_StreamBase): stream_id: _int # Stream id device_index: _int device_type: _int device: _device # The device of the stream @overload def __new__(self, device: Optional[DeviceLikeType] = None, priority: _int = 0) -> Stream: ... @overload def __new__(self, stream_id: _int, device_index: _int, device_type: _int, priority: _int = 0) -> Stream: ... def query(self) -> _bool: ... def synchronize(self) -> None: ... def wait_event(self, event: Event) -> None: ... def wait_stream(self, other: Stream) -> None: ... def record_event(self, event: Optional[Event] = None) -> Event: ... def query(self) -> None: ... def synchronize(self) -> None: ... def __hash__(self) -> _int: ... def __repr__(self) -> str: ... def __eq__(self, other: object) -> _bool: ... ``` ------------------ torch.Event APIs: - IPC related APIs are not implemented, since many device backends don't support it, but we leave interfaces there for future adaption of torch.cuda.Stream. - currently only the enable_timing is supported, since it is the most common one used in other device backends. We have to refactor the event flag system in PyTorch to support more fancy flag. - elapsedTime API is added to c10::Event ``` # Defined in torch/csrc/Event.cpp class Event(_EventBase): device: _device # The device of the Event event_id: _int # The raw event created by device backend def __new__(self, device: Optional[DeviceLikeType] = None, enable_timing: _bool = False, blocking: _bool = False, interprocess: _bool = False) -> Event: ... @classmethod def from_ipc_handle(self, device: DeviceLikeType, ipc_handle: bytes) -> Event: ... def record(self, stream: Optional[Stream] = None) -> None: ... def wait(self, stream: Optional[Stream] = None) -> None: ... def query(self) -> _bool: ... def elapsed_time(self, other: Event) -> _float: ... def synchronize(self) -> None: ... def ipc_handle(self) -> bytes: ... def __repr__(self) -> str: ... ``` ----------- c10::Event provides new APIs - calculate elapsedTime. - Get raw event id - Synchronize event. ``` double elapsedTime(const Event& event) const { return impl_.elapsedTime(event.impl_); } void* eventId() const { return impl_.eventId(); } void synchronize() const { return impl_.synchronize(); } ``` ---------- TODO: need to find a good way to test them in PyTorch with API mocks. Differential Revision: [D55351839](https://our.internmc.facebook.com/intern/diff/D55351839/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123611 Approved by: https://github.com/albanD	2024-04-18 17:35:09 +00:00
Jason Ansel	7a6edb0b66	Possible fix for einops warning (#124084 ) See https://github.com/arogozhnikov/einops/issues/315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124084 Approved by: https://github.com/peterbell10	2024-04-18 17:09:50 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	a8cf91c395	Fix predispatch tracing for aten::lift_fresh_copy (#124198 ) Differential Revision: D56200666 Previously, when we hit the Functionalize kernel for lift_fresh_copy, we directly dispatch self.clone() to proxy dispatch. As a result, we end up receiving a functional tensor at proxy dispatch. As a work around, I unwrap self manually. Not sure, why it works ok in aot-dispatch tho Pull Request resolved: https://github.com/pytorch/pytorch/pull/124198 Approved by: https://github.com/bdhirsh	2024-04-18 17:02:38 +00:00
Zhengxu Chen	e1062f5738	[export] Add a printer to unflattened module. (#124315 ) Summary: add a helper method to print graph in every level of unflattened module. Test Plan: {F1489609684} Differential Revision: D56263195 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124315 Approved by: https://github.com/tugsbayasgalan	2024-04-18 16:35:51 +00:00
vfdev-5	415a8f6398	Fixed issue in affine_grid_backward when grad_grid is non-contiguous (#124370 ) Description: - replaced .view with .reshape to fix the problem when grad_grid is channels last 2d/3d - added a consistency test Fixes #124154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124370 Approved by: https://github.com/lezcano	2024-04-18 16:30:10 +00:00
Boyuan Feng	aa2da0cdd2	[Export] Add runtime assert to non-strict export (#123681 ) This PR moves insert_deferred_runtime_asserts from dynamo to torch.fx.passes and uses it to add runtime assertion for non-strict export. Differential Revision: D55944267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123681 Approved by: https://github.com/tugsbayasgalan, https://github.com/angelayi	2024-04-18 16:13:27 +00:00
Nikita Shulga	5677128cb8	[MPS] Fix crash with binary_cross_entropy is invoked for half dtypes (#124258 ) By creating constants using input tensors dtype One line reproducer: ``` python -c "import torch; x=torch.arange(3, dtype=torch.float16,device='mps');print(torch.nn.functional.binary_cross_entropy(x, x))" ``` Before the change ``` loc("mps_subtract"("(mpsFileLoc): /AppleInternal/Library/BuildRoots/ce725a5f-c761-11ee-a4ec-b6ef2fd8d87b/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShadersGraph/mpsgraph/MetalPerformanceShadersGraph/Core/Files/MPSGraphUtilities.mm":233:0)): error: input types 'tensor<f32>' and 'tensor<3xf16>' are not broadcast compatible LLVM ERROR: Failed to infer result type(s). ``` After ``` tensor(-33.7812, device='mps:0', dtype=torch.float16) ``` Fixes https://github.com/pytorch/pytorch/issues/124252 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124258 Approved by: https://github.com/kulinseth	2024-04-18 15:21:01 +00:00
soulitzer	ef93402f61	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-18 14:42:54 +00:00
Andrew Gu	bbb6e36495	[FSDP2] Fixed `set_requires_gradient_sync`'s `recurse` arg (#124318 ) The `recurse` argument was not being respected for `set_requires_gradient_sync`. This PR fixes that. The previous unit test did not have nested FSDP modules with managed parameters, so the `recurse=False` was not being exercised. We augment the unit test to try only disabling gradient sync for the root module and not children. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124318 Approved by: https://github.com/weifengpy ghstack dependencies: #120952, #124293	2024-04-18 14:21:57 +00:00
PyTorch MergeBot	9385ef2a5d	Revert "Skip workspace permission change for ROCm CI (#123816 )" This reverts commit `4322a0e782`. Reverted https://github.com/pytorch/pytorch/pull/123816 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/123816#issuecomment-2063949316))	2024-04-18 14:07:09 +00:00
Yu, Guangye	1325fd94a4	Support xpu autocast policy (#124052 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124052 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-04-18 14:06:48 +00:00
cyy	b51f66c195	[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449 ) This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449 Approved by: https://github.com/albanD	2024-04-18 13:35:48 +00:00
rzou	1542874311	Delete qualname from custom_op decorator (#124092 ) I forgot to delete this in an earlier PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124092 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071, #124089	2024-04-18 12:48:04 +00:00
rzou	648c39c47d	Add OpOverload.redispatch; use it in new custom ops API (#124089 ) A kernel has "dispatcher convention" if there is an additional keyset arg at the beginning of the argument list. This PR: - adds a way to register kernels with dispatcher_convention using Library.impl (pass dispatcher_convention = True) - adds OpOverload.redispatch We use both of the above in the new custom ops API: we register the autograd kernel in dispatcher convention so that we can actually call redispatch like how pytorch built-in ops do it. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124089 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066, #124071	2024-04-18 12:48:04 +00:00
rzou	645173a0b5	Add torch.library.register_autograd (#124071 ) Allows registering autograd for all custom op entry points: - the new-style custom op API (custom_op) - the old-style torch.library APIs - C++ operator registration Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124071 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065, #124066	2024-04-18 12:47:59 +00:00
rzou	8135c4b921	torch.library.register_fake now accepts more types (#124066 ) We allow it to accept: - a string with the op name - an opoverload - a new-style custom op If any of these are referring to a new-style custom op (created with the custom_op decorator), then we dispatch to CustomOpDef.register_fake. Otherwise, we do what we previously did. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124066 Approved by: https://github.com/albanD ghstack dependencies: #123937, #124064, #124065	2024-04-18 12:47:55 +00:00
Yu, Guangye	a0466061e1	Support xpu host allocator (#123080 ) # Motivation This PR mainly covers caching host allocator supported on xpu backend. # Solution `XPUCachingHostAllocator` adopts the same caching mechanism as cuda via two abstract interfaces -`CachingHostAllocatorImpl` and `CachingHostAllocatorInterface`. # Additional Context Following CUDA, this PR adds a new API `getPinnedMemoryAllocator` to support the tensor's memory pinned. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123080 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD	2024-04-18 12:29:21 +00:00
DanilBaibak	b5d4ebe9ae	Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC (#123722 ) Migrate linux-focal-cuda12_1-py3_10-gcc9-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123722 Approved by: https://github.com/jeanschmidt	2024-04-18 12:06:57 +00:00
DanilBaibak	d032a78008	Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC (#123721 ) Migrate linux-focal-cuda11_8-py3_10-gcc9-build to ARC Pull Request resolved: https://github.com/pytorch/pytorch/pull/123721 Approved by: https://github.com/jeanschmidt	2024-04-18 12:06:28 +00:00
xinan.lin	6fcbeb3489	[ATen] Add CPU fp16 support for nll_loss and cross_entropy_loss (#123256 ) Add CPU FP16 support for nll_loss and cross_entropy_loss. Resolve issue #123328. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123256 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/malfet	2024-04-18 11:44:38 +00:00
IvanKobzarev	d59f1da62f	[sym_shapes][perf] _find not update unchanged replacements (#124274 ) Differential Revision: [D56236380](https://our.internmc.facebook.com/intern/diff/D56236380) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124274 Approved by: https://github.com/ezyang	2024-04-18 08:32:02 +00:00
IvanKobzarev	9eba1995d0	[sym_shapes][perf] Use sympy xreplace instead of subs (#124208 ) https://github.com/sympy/sympy/issues/22240 Differential Revision: [D56207553](https://our.internmc.facebook.com/intern/diff/D56207553) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124208 Approved by: https://github.com/ezyang, https://github.com/lezcano	2024-04-18 08:19:03 +00:00
PyTorch MergeBot	2b82345e48	Revert "Re-land precompile triton templates (#124030 )" This reverts commit `030bb13fe8`. Reverted https://github.com/pytorch/pytorch/pull/124030 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124030#issuecomment-2063191117))	2024-04-18 07:21:41 +00:00
Animesh Jain	704fac5618	[dynamo][cpp-guard] Reland Attempt 1 - Enable cpp guard manager (#124231 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124231 Approved by: https://github.com/jansel ghstack dependencies: #124230, #124237	2024-04-18 06:36:20 +00:00
PyTorch MergeBot	6e86a40694	Revert "[Dynamo] Check for __bool__ attribute before accessing it (#120943 )" This reverts commit `dd7aeedb72`. Reverted https://github.com/pytorch/pytorch/pull/120943 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/120943#issuecomment-2063098295))	2024-04-18 06:34:32 +00:00
PyTorch MergeBot	8ff85b42f9	Revert "Add swap_tensors path to nn parametrizations (#124130 )" This reverts commit `64f6ddf12c`. Reverted https://github.com/pytorch/pytorch/pull/124130 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124130#issuecomment-2063074856))	2024-04-18 06:12:54 +00:00
Yu, Guangye	ec608a5d66	Refactor CUDA's amp cast policy to be generic (#124051 ) # Motivation This PR intends to create several op lists for different policies: - `AT_FORALL_LOWER_PRECISION_FP` for policy `lower_precision_fp` - `AT_FORALL_FP32` for policy `fp32` - `AT_FORALL_FP32_SET_OPT_DTYPE` for policy `fp32_set_opt_dtype` - `AT_FORALL_PROMOTE` for policy `promote`. To make sure the other backend can reuse the policy op list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124051 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #124050	2024-04-18 04:35:25 +00:00
Zhuoran Zhao	8ad66e05d2	[4/x][AMD][Lowering Enablement] Enabling meta internal AOTInductor compilation on ROCM (#124123 ) Summary: as title Test Plan: CI & unit test Differential Revision: D56163334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124123 Approved by: https://github.com/chenyang78, https://github.com/jansel	2024-04-18 04:19:37 +00:00
Xiaodong Wang	de1c0d2497	[cublas] Keep explicit workspace creation to avoid OOM (#124250 ) Summary: We explicitly set the cublas workspace even though CUDA 12.2+ fixed the issue where memory usage increased during graph capture. Original issue: https://github.com/pytorch/pytorch/pull/83461 This is because in CUDA 12.2+, the use of cudaMallocAsync in cublas will allocate memory dynamically (even if they're cheap) outside PyTorch's CUDA caching allocator. It's possible that CCA used up all the memory and cublas's cudaMallocAsync will return OOM Test Plan: CI Differential Revision: D56226746 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124250 Approved by: https://github.com/houseroad, https://github.com/eqy	2024-04-18 04:17:38 +00:00
xinan.lin	c9ab9248ce	[Inductor Intel GPU backend Upstream] Generalize device-bias code in (#124249 ) Generalize device-bias code in tirton_utils.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/124249 Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/jansel	2024-04-18 03:54:31 +00:00
Yanan Cao (PyTorch)	27daa110c8	Back out "Refresh OpOverloadPacket if a new OpOverload gets added (#123578 )" (#124324 ) Summary: Original commit changeset: 528276bc8a92 Original Phabricator Diff: D56057952 Differential Revision: D56271240 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124324 Approved by: https://github.com/davidberard98	2024-04-18 03:33:54 +00:00
Animesh Jain	f213f262af	[dynamo][cpp-guards] Improve when to use Dict vs DictSubclassGuardManager (#124237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124237 Approved by: https://github.com/jansel, https://github.com/mlazos ghstack dependencies: #124230	2024-04-18 03:33:37 +00:00
Iris Z	9fed2e826b	[DTensor][Test] Add unit tests to keep track of DTensor sharding for 2D (#123687 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/123687 Approved by: https://github.com/wanchaol	2024-04-18 03:29:16 +00:00
William Wen	dca24d70ba	[dynamo, test] remove skip for unhandled exception test (#123876 ) This test might no longer segfault in CI due to changes to how we allocate and free shadow frames in dynamo. Pull Request resolved: https://github.com/pytorch/pytorch/pull/123876 Approved by: https://github.com/jansel	2024-04-18 03:02:34 +00:00
William Wen	812bae09be	[dynamo] fix 3.11+ refleak (#124238 ) Fixes https://github.com/pytorch/pytorch/issues/119607 for 3.11+. In 3.11+, `_PyFrame_FastToLocalsWithError` could implicity run `COPY_FREE_VARS` on the original frame, leading to double incref's since the dynamo shadow frame can rerun `COPY_FREE_VARS`. So the solution is to skip the first `COPY_FREE_VARS` instruction in the shadow frame if it was already executed in the original frame. Also move the location for clearing the original frame in 3.12 to handle error cases more thoroughly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124238 Approved by: https://github.com/jansel	2024-04-18 03:02:29 +00:00
Simon Fan	7c94652d7d	[benchmarks] Add --use-warm-peak-memory (#124326 ) Measuring peak memory on the first run can capture cases where compiled artifacts leak into runtime, but it also introduces a lot of noise from cudnn/triton autotuning which generally uses as much memory as it can. Setting this flag as a default will need some discussion, so I will only add it to unblock compiled backward benchmarking (where all autotuning memory use is exposed) ``` e.g. resnet50 # without --warm-peak-memory memory: eager: 1.95 GB, dynamo: 6.68 GB, ratio: 0.29 # with --warm-peak-memory memory: eager: 1.96 GB, dynamo: 2.06 GB, ratio: 0.95 ``` ![image](https://github.com/pytorch/pytorch/assets/9547562/36cd8687-a7f7-4ec6-b989-7e1263aa7d37) This issue may also affect large models. Here's an example case of cudnn_convolution_backward autotuning allocating 30GB to tune a model otherwise using 5GB memory: ![image](https://github.com/pytorch/pytorch/assets/9547562/4e544b11-3579-4c69-811a-91d896f1ba66) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124326 Approved by: https://github.com/jansel ghstack dependencies: #119411	2024-04-18 02:57:01 +00:00
Simon Fan	0ddd17bdc6	[benchmarks] Add --snapshot-memory to get memory pickles for eager vs compiled (#119411 ) creates memory snapshot pickles e.g. ``` inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_compiled_pytorch_stargan.pickle inductor_no_cudagraphs_torchbench_amp_training_cuda_performance_eager_pytorch_stargan.pickle ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/119411 Approved by: https://github.com/jansel	2024-04-18 02:57:01 +00:00
Animesh Jain	6b4b857a60	[dynamo][nn_module] Enable torch.compile/disable as decorators on the class (#124187 ) Support something like. This is UI change, so please review carefully. ~~~ @torch._dynamo.disable class SimpleLinear(torch.nn.Module): def __init__(self): super().__init__() self.layer0 = torch.nn.Linear(4, 4) def forward(self, inp): return self.layer0(torch.sigmoid(inp)) @torch.compile(backend=cnts) class SimpleModel(torch.nn.Module): def __init__(self): super().__init__() self.layer0 = SimpleLinear() self.layer1 = torch.nn.Linear(4, 4) def forward(self, inp): z = self.layer0(torch.sin(inp)) return self.layer1(z) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/124187 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-18 02:51:30 +00:00
Simon Fan	b6b757701e	[aot] trim refcount for subclass runtime wrapper (#124155 ) On torchtrain, before <img width="1218" alt="image" src="https://github.com/pytorch/pytorch/assets/9547562/b340c114-071a-440c-904c-c042de4d92c5"> after ![image](https://github.com/pytorch/pytorch/assets/9547562/ee3b6e6f-6e46-46bc-a93d-d4603673ee63) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124155 Approved by: https://github.com/jansel, https://github.com/bdhirsh ghstack dependencies: #124127	2024-04-18 02:34:52 +00:00
Sun, Jiayi	1f04c29be5	[inductor] Freeze the layout of the conv input to channels_last (#122765 ) Fix https://github.com/pytorch/pytorch/issues/118082. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122765 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-04-18 02:23:38 +00:00

1 2 3 4 5 ...

71948 commits