Fix staging for CPU tensors in OSS DCP async_save (#145408)
Summary:
As found in
https://github.com/pytorch/pytorch/issues/144657
for CPU tensors we accidentally skip copying during staging due to using offload to cpu helper, which does a no-op for CPU tensors. This means that if the trainer changes the original source CPU tensor value after launch async save but before the actual writing/uploading to the destination commences, the writing/uploading logic will accidentally pick up the latest state of the tensor, while it should have dealt with its own dedicated copy saved earlier. Dropping _offload_state_dict_to_cpu in favor of _copy_state_dict fixes this bug.
Test Plan:
Running the user script from the linked GitHub issue verifies the fix:
```
import os
import torch
import torch.distributed as dist
import torch.distributed.checkpoint as dcp
from torch.distributed.checkpoint.state_dict import get_model_state_dict
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.weight = nn.Parameter(torch.ones(1, 1))
def forward(self, x):
return self.layer(x)
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12345"
os.environ["WORLD_SIZE"] = "1"
os.environ["RANK"] = "0"
dist.init_process_group()
model = Net()
state_dict = get_model_state_dict(model)
pg = dist.new_group(backend="gloo")
try:
steps = [10, 20, 30, 40, 50]
future = None
for step in steps:
# simulate a training step, e.g. optimizer updating values
with torch.no_grad():
model.weight.data.fill_(step)
if future is not None:
future.result()
future = None
future = dcp.async_save(
state_dict,
checkpoint_id=f"outputs/{step}",
process_group=pg,
)
future.result()
for step in steps:
dcp.load(
state_dict,
checkpoint_id=f"outputs/{step}",
process_group=pg,
)
assert state_dict["weight"][0, 0] == step, f"got {state_dict['weight'][0, 0]=} on {step=}"
finally:
dist.destroy_process_group(pg)
dist.destroy_process_group()
```
passes all asserts with this fix. If the script is run in trunk, confirmed that it fails the first assert.
Differential Revision: D68518689
This PR implements the idea of checking input mutations through tensor version and check aliasing via storage from @zou3519. Previously, we rely on whether there's a in place op that takes placeholder input, which doesn't take views into account.
When writing the PR, I also noticed a bug in previous input mutation checking logic: we were checking the whether there are operators functionalized_f where all the mutating ops have been replaced so we won't be able to detect any thing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145298
Approved by: https://github.com/zou3519
PR#141153 exposed the option to collect sizes as dynamic. After this
change, the function set_autograd_compiler returns PyTuple object which
is populated using PyTuple_SET_ITEM function. Yet, that function steals
reference to the object and doesn't INCREF it. So currently we are
missing INCREF on prior_compiler when it is Py_None and INCREF on
prior_dynamic which is either Py_False or Py_True. This bug may lead to
the possible memory corruption.
@xmfan @jansel @albanD
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145482
Approved by: https://github.com/albanD, https://github.com/jansel
# Summary
- Adds support for non-power of 2 headdim by launching blocks w/ head_dim rounded to the next valid power.
- Other option I considered was building up the final dot_products with smaller blocks (this would probably work but for sake of code complexity going with this option for now)
### Corollary
We had a bug in our backwards kernel where we were using index_k instead of index_v. This should have shown up for the qk_head_dim != v_head_dim cases..
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133495
Approved by: https://github.com/Chillee
Summary:
Importing Iterable from collections.abc here causes an internal product to fail
MRO discovery causing a collision between Iterable and Generic.
This fixes the failure on D68461304
Differential Revision: D68531443
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145438
Approved by: https://github.com/izaitsevfb
We use `cpu_tensor.copy_(gpu_tensor)` to clone mutated kernel arguments for autotuning. The purpose is to avoid increasing peak memory due to the clone. But if `gpu_tensor` is not contiguous, this `copy_` will need allocate an temporary tensor on GPU to store a contiguous copy of `gpu_tensor`:
6e53588789/aten/src/ATen/native/cuda/Copy.cu (L322-L334)
Here is a standalone script to illustrate this behavior: https://gist.github.com/shunting314/812a848dc67b1d674ae42415a7a462c8 . The script report 6GB rather than 3GB peak memory usage.
Note that, with all the following efforts
1. donated buffer
2. inplace padding
3. this PR
We save 3GB peak memory (18.6GB -> 15.5GB) for GPT2 model for torch.compile.
The peak memory of GPT2 is like a '...\_M\_...' shape. There are 2 places that we reach the peak. Donated buffer remove the first peak by computing grad_softmax inplace, and inplace padding removes the second peak by not allocating an extra buffer for mm-padding.
Before all these optimizations, the peak memory is 18.6GB for GPT2 with torch.compile.
With 1 & 2, the peak memory is
1. 17.7GB with a cold cache
2. 15.5GB with a warm cache (since the autotuning overhead is skipped)
With 1 & 2 & 3, we save 3GB peak memory (18.6GB -> 15.5GB) no matter if autotuning happens or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145410
Approved by: https://github.com/masnesral, https://github.com/jansel
ghstack dependencies: #140249, #145325
**MOTIVATION**
We recently integrated support for Intel Gaudi devices (identified as 'hpu') into the common_device_type framework via the pull request at https://github.com/pytorch/pytorch/pull/126970. This integration allows tests to be automatically instantiated for Gaudi devices upon loading the relevant library. Building on this development, the current pull request extends the utility of these hooks by adapting selected CUDA tests to operate on Gaudi devices. Additionally, we have confirmed that these modifications do not interfere with the existing tests on CUDA devices.
Other accelerators can also extend the functionality by adding the device in the devices list. ( For eg: xpu )
**CHANGES**
Create a separate class for test functions running on CUDA devices
Extend the functionality of these tests to include HPUs
Use instantiate_device_type_tests with targeted attributes to generate device-specific test instances within the new classes
Apply skipIfHPU decorator to bypass tests that are not yet compatible with HPU devices
Previously we had submitted some changes in https://github.com/pytorch/pytorch/pull/140131 . However, deleted that PR due to merge conflicts and other issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144387
Approved by: https://github.com/ankurneog, https://github.com/EikanWang, https://github.com/yanboliang, https://github.com/guangyey
Some context: Inplace padding is an optimization to do padding in place. E.g., if a tensor has size [2048, 2047] and stride [2048, 1]. When we need pad one extra element to the end of each row (e.g. during mm padding), we can just reuse the original tensor and do the padding inplace. This saves memory and bandwidth. One caveat for this optimization is, PyTorch does not allocate 2048 elements for the last row of the original tensor. It only allocate 2047 elements. So assuming the last row having enough space for 2048 elements may be wrong and cause OOB memory access (although I never see this happen maybe due to overallocation in the CUDACachingAllocation, this should better be fixed).
The fix is when we allocate the tensor, instead of doing something like:
```
buf0 = randn_strided([2048, 2047], [2048, 1])
```
we do some small overallocation
```
buf0 = randn_strided([2048, 2048], [2048, 1]).as_strided([2048, 2047], [2048, 1])
```
cpp_wrapper needs special handling since memory allocation goes thru different code path to python wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145325
Approved by: https://github.com/desertfire, https://github.com/jansel
ghstack dependencies: #140249
Summary:
Explicitly catch c10 error and log the error message only.
The standard exception `e.what()` below ends up logging the stack trace that is confusing users.
See S477887 for details.
Test Plan:
tested locally.
```
buck test caffe2/test/cpp/c10d:TCPStoreTest
buck2 daemon constraint mismatch: Version mismatch; killing daemon...
Starting new buck2 daemon...
Connected to new buck2 daemon.
File changed: fbcode//caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
File changed: fbsource//xplat/caffe2/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
Soft Error: source_directory_includes_subpackage: Directory `v2.17.1-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.17.1-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.18.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.18.3-1/src/tests`.
Soft Error: source_directory_includes_subpackage: Directory `v2.19.3-1` of package `fbsource//third-party/nccl` may not cover any subpackages, but includes subpackage `v2.19.3-1/src/tests`.
Buck UI: https://www.internalfb.com/buck2/dbd34fa4-50ed-4eeb-800d-688f5a7bec68
Test UI: https://www.internalfb.com/intern/testinfra/testrun/281475375994918
Network: Up: 1.5GiB Down: 4.7GiB (reSessionID-d6b0568e-2347-4375-a2d9-2d03ca0c2161)
Loading targets. Remaining 0/3024 69199 dirs read, 687558 targets declared
Analyzing targets. Remaining 0/31483 1481904 actions, 1719048 artifacts declared
Executing actions. Remaining 0/250391 77:11:29.7s exec time total
Command: test. Finished 2031 local, 45445 remote, 51473 cache (52% hit) 20:16:36.9s exec time cached (26%)
Time elapsed: 7:32.7s
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D68516080
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145413
Approved by: https://github.com/fduwjj
- Added another workflow to run the mi300 jobs post-merge.
- Updated rocm.yml to use mi200s instead of mi300s.
- Required to get an idea of how PRs are landing on our mi200s and mi300s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145398
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>