For #143486
Interestingly enough changing the indexing type seems to degrade performance when a larger width is not needed, even on small sizes, so making this a template param rather than forcing all cases to 64-bit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143696
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
In hinsight, we never needed a DICT_SUBCLASS_GUARD_MANAGER, because Dynamo would inline through the overridden keys method. In this PR, we ensure that while creating guards and constructing variable trackers, we get the `d.keys()` value by using `dict.keys(d)`. This ensures that we do not call overridden keys method. Therefore, the C++ guard can use `PyDict_Next` directly to check the guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143722
Approved by: https://github.com/jansel
Fixes#143071
Operations performed on tensors with `requires_grad=True` such as
```python
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x ** 3
```
and
```python
x = torch.tensor(2.0, requires_grad=True)
y = torch.pow(x,3)
```
are valid operations.
While an operation using `numpy` like
```python
import numpy as np
x = torch.tensor(2.0, requires_grad=True)
y = np.pow(x,3)
# > RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.
```
leads to an error.
However, an operation that uses `math` like
```python
import math
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
```
does not cause an error, and `y` is no longer a tensor with a gradient!
This represents a [footgun](https://en.wiktionary.org/wiki/footgun#Noun) for some users, like myself when training small, custom, non-neural network models.
To prevent future undesired behavior, I added a warning when converting tensors with `requires_grad=True` to scalars. Now, when using `math.pow` on a `tensor`, we get a single warning with:
```python
x = torch.tensor(2.0, requires_grad=True)
y = math.pow(x,3)
# > UserWarning: Converting a tensor with requires_grad=True to a scalar may lead to unexpected behavior.
# Consider using tensor.detach() first.
```
Please let me know if you have any questions 👍
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143261
Approved by: https://github.com/albanD
Fixes#104899
Refactors AdamW into Adam by making AdamW a subclass of Adam. Additionally adds a test to assert that the added parameter `decoupled_weight_decay` is True in AdamW and also updates test_defaults_changed_to_foreach to account for the differences in module location for AdamW.
Heavily heavily inspired by #118857 by @tfsingh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143710
Approved by: https://github.com/janeyx99
Summary: Mostly cosmetic, but one bug fix:
* Bug fix: Make sure compile_id is converted to a string in the compilation metrics so it's printed as, e.g., "0/1" instead of "[0, 1]"
* Sort collections in `collection_to_str`
* Print non-string elements as `"<unknown>"` instead of None (since we don't expect non-strings)
* Move the population of the legacy metrics and any pre-processing to a new factory method in CompilationMetrics
Test Plan:
```
python test/dynamo/test_structured_trace.py
python test/dynamo/test_utils.py
```
Internal testing: https://fburl.com/scuba/dynamo_compile/sandbox/l0me8auf
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143332
Approved by: https://github.com/ppanchalia
In some situations we want to profile calls coming from all threads (similar to on-demand), not just the thread that started profiling and the spawned threads that would inherit KinetoThreadLocal state.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143659
Approved by: https://github.com/sraikund16
# Motivation
Fix https://github.com/pytorch/pytorch/issues/143543
# Solution
We should raise python exception instead of aborting...
# Additional Context
without this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
terminate called after throwing an instance of 'c10::Error'
what(): device is out of range, device is 2, total number of device is 2.
Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so)
frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so)
frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so)
<omitting python frames>
frame #20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6)
Aborted (core dumped)
```
with this PR:
```python
>>> import torch
>>> torch.accelerator.current_stream(torch.accelerator.device_count())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream
return torch._C._accelerator_getStream(device_index)
RuntimeError: The device index is out of range. It must be in [0, 2), but got 2.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143550
Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
Summary:
If module being quantized contains a some meta tensors and some tensors with actual device, we should not fail quantization.
Quantization should also not fail if new quantized module is created on a meta device.
Differential Revision: D66895899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142262
Approved by: https://github.com/iamzainhuda