One of the tests in this file was setting `self._logging.set_logs(output_code=True)` - which would cause logs to be printed for the rest of the tests in this file.
This PR puts the log-setting in a context manager so that the old behavior is restored afterwards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145895
Approved by: https://github.com/nmacchioni
The original `_all_gather_keys` call was for a safety check, but could be costly as things scale, and it blocks CPU.
Instead, we make it clear in the documentation that the `state_dict` passed to the `load` API should have same set of keys, otherwise the API may hang.
In addition, we move the check to a utility function: `utils.assert_same_keys`. User uncertain about state dict unity can optionally call this API to check.
Resolves#145965 (as a workaround).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145998
Approved by: https://github.com/mhorowitz, https://github.com/fegin
This enforces the invariant that every backend implements the same set of ops and removes a layer of indirection for BasicMathOps.
Interestingly this is a small compile time win:
```
...
WIN: benchmark ('add_loop_inductor', 'compile_time_instruction_count') failed, actual result 30151159301 is -6.13% lower than expected 32120000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('add_loop_inductor_dynamic_gpu', 'compile_time_instruction_count') pass, actual result 44447549162 -1.69% is within expected 45210000000 ±2.50%
WIN: benchmark ('add_loop_inductor_gpu', 'compile_time_instruction_count') failed, actual result 26743557195 is -2.25% lower than expected 27360000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
PASS: benchmark ('basic_modules_ListOfLinears_eager', 'compile_time_instruction_count') pass, actual result 945129734 +0.93% is within expected 936400000 ±1.50%
WIN: benchmark ('basic_modules_ListOfLinears_inductor', 'compile_time_instruction_count') failed, actual result 18984384503 is -3.19% lower than expected 19610000000 ±1.50% please update the expected results.
please update all results that changed significantly, and not only the failed ones
WIN: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 17258025389 is -1.94% lower than expected 17600000000 ±1.50% please update the expected results.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146235
Approved by: https://github.com/shunting314
ghstack dependencies: #146225, #146226
A rewrite of #138964
In addition to rewriting the conditions for using copy2d, this PR fixes a few other problems with #138964:
1) gpu-gpu copies when peer access is disabled shouldn't rely on copy2d
2) copy2d should record even for the host pinned memory, like the regular copy does
3) copy2d shouldn't pretend that it's synchronizing (for the purposes of cuda sanitizer tracer) when it's non-blocking
In this PR copy2d behaves in exactly the same way as copy does wrt to those additional syncs, except it calls a different underlying cuda call.
Tests for multiple cases going through copy2d and avoiding copy2d pattern due to unsatisfied conditions are added.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146256
Approved by: https://github.com/eqy, https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Fixing the following issue when compiling the following program:
```
window = torch.hann_window(N_FFT).to(x.device)
stft = torch.stft(
x, N_FFT, HOP_LENGTH, window=window, return_complex=True
)
magnitudes = stft[..., :-1].abs() ** 2
return magnitudes
```
```
Traceback (most recent call last):
File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 57, in testPartExecutor
yield
File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 623, in run
self._callTestMethod(testMethod)
File "/home/zhxchen17/miniconda3/envs/dev/lib/python3.11/unittest/case.py", line 579, in _callTestMethod
if method() is not None:
^^^^^^^^
File "/home/zhxchen17/pytorch/torch/testing/_internal/common_utils.py", line 3120, in wrapper
method(*args, **kwargs)
File "/home/zhxchen17/pytorch/test/inductor/test_torchinductor.py", line 12356, in new_test
return value(self)
^^^^^^^^^^^
File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor.py", line 4334, in test_stft
self.check_model(model, example_inputs)
File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 185, in check_model
actual = AOTIRunnerUtil.run(
^^^^^^^^^^^^^^^^^^^
File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 137, in run
optimized = AOTIRunnerUtil.load(device, so_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhxchen17/pytorch/test/inductor/test_aot_inductor_utils.py", line 119, in load
return torch._export.aot_load(so_path, device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/zhxchen17/pytorch/torch/_export/__init__.py", line 165, in aot_load
runner = torch._C._aoti.AOTIModelContainerRunnerCuda(so_path, 1, device) # type: ignore[assignment, call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected extern kernel aten::hann_window to have serialized argument type as_scalar_type for argument 1 but got as_device
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146263
Approved by: https://github.com/angelayi
The goal of this PR is to provide 3 ways for people to try out CUTLASS backend:
1. fbcode / internal
2. pip install torch (nightly) and pip install nvidia-cutlass
3. build from source
I will go into more detailed combos between building from source and downloading via pip for torch and cutlass.
repro:
```
import torch
import torch.nn as nn
import torch._inductor.config as config
config.force_disable_caches = True
config.max_autotune = True
config.max_autotune_gemm_backends = "CUTLASS"
# the following is only needed if you use a custom cutlass library
# config.cuda.cutlass_dir = "/data/users/henrylhtsang/cutlass"
class TestModule(nn.Module):
def forward(self, A, B):
return A @ B
model = TestModule().cuda()
M, K, N = 2048, 2048, 2048
A = torch.randn(M, K).cuda().half()
B = torch.randn(K, N).cuda().half()
C = torch.compile(model, fullgraph=True)(A, B)
```
## pre-requisite
Assuming you have the right cuda toolkit. Recommend 12.4. Make sure PATH, LD_LIBRARY_PATH and CUDA_NVCC_EXECUTABLE are good.
## combo 1: pip install torch + pip install nvidia-cutlass
Check https://pytorch.org/get-started/locally/ for **nightly** install command.
```
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu124
pip install nvidia-cutlass
```
Then try running the script above. It should work.
## combo 2: build torch from source + pip install nvidia-cutlass
This is going to be be pretty straightforward. Just keep in mind that even though pytorch/third_party/cutlass exists, the one that will be used is the pip package, so mindful of version differences.
## combo 3: build torch from source + use pytorch/third_party/cutlass
This is how most pytorch devs would do it. Just make sure you don't have a cutlass pip package installed, i.e., make sure `import cutlass_library` would fail on its own.
## combo 4: any torch version + cutlass library from somewhere else
This is probably the only case you need to pass in cutlass_dir. Just set cutlass_dir to the cutlass repo library. The expectations is that cutlass_dir is the directory that contains include, tool, and python/cutlass_library.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145891
Approved by: https://github.com/Chillee, https://github.com/ColinPeppler
Requested in #77764
This PR adds support for linalg.det on MPS and fixes lu factor for non contiguous tensors, current implementation crashed on any kind of non-contiguous tensor with an error:
```
-[AGXG13XFamilyCommandBuffer blitCommandEncoderCommon:]:833: failed assertion `A command encoder is already encoding to this command buffer'
zsh: abort python det.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146279
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
It's unused otherwise, and when running MPS tests, I get a bunch of warnings of this kind:
/Users/davidino/pytorch/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/model_container.h:412:10: warning: private field 'blob_size_' is not used [-Wunused-private-field]
412 | size_t blob_size_;
| ^
1 warning generated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146278
Approved by: https://github.com/Skylion007, https://github.com/jansel