Enables additional inductor UTs on ROCm and un skips outdated skips.
I have also removed a group of failures in `test_torchinductor_opinfo` which are now passing for CUDA and ROCm
```
- # The following 3 tests fail on CUDA with AssertionError: expected size 5==5, stride 5==1 at dim=0
- # linalg._svd's return value has different strides on CUDA vs CPU which causes this
- # In test_meta.py there is a mechanism to skipping strides checks for some ops
- # (including _linalg_svd), possibly we should have something similar here
- "linalg.cond": {f32, f64},
- "linalg.svdvals": {f32, f64},
- "linalg.matrix_rank": {f32, f64},
- "linalg.svd": {f32, f64},
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104624
Approved by: https://github.com/malfet
This makes it easier to exclude multi-line messages using single line
grepping. If your screen is wide enough this should not be a big
problem.
Example of what it looks like:
```
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] GUARDS:
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] hasattr(L['x'], '_dynamo_dynamic_indices') == False
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] ___is_grad_enabled()
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] not ___are_deterministic_algorithms_enabled()
[2023-07-10 20:11:30,529] torch._dynamo.convert_frame.__guards: [DEBUG] utils_device.CURRENT_DEVICE == None
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104932
Approved by: https://github.com/mlazos, https://github.com/albanD
Currently, negative unspecified ints get specialized. This PR creates symbolic values for
unspecified ints (including negative ones).
For example, with this PR, the following code only compiles once, instead of 3 times:
```python
def foo(x, y):
return torch.fill(torch.zeros(x.shape), y)
foo(10)
foo(-5)
foo(-3)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104658
Approved by: https://github.com/ezyang
Summary:
QAT convert for mobilenetv2 was previously not working
because we incorrectly applied dropout during eval as well as
training. This is because, for exported models, model.eval() does
not change the behavior of dropout, unlike models with torch ops.
This commit simulates the effects of model.eval() for exported
models as well by replacing the aten dropout pattern before eval.
As of this commit, end-to-end QAT numerics now match for
mobilenetv2 between FX and PT2.
Test Plan: python test/test_quantization.py TestQuantizePT2EModels.test_qat_mobilenet_v2
Differential Revision: D46750343
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104110
Approved by: https://github.com/jerryzh168
This PR disables translation validation (TV) when running the benchmark suits on
performance workflows: inductor with A100s.
In summary, the changes are:
- Add flag for turning TV on and off on _benchmarks/dynamo/common.py_
- Turn TV on only on CI accuracy builds
- Add `--no-translation-validation` target flag to _.ci/pytorch/test.sh_
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104887
Approved by: https://github.com/ezyang
Context
-------
This PR adds a new fallback to the Autograd dispatch keys.
If you would prefer the old behavior:
- A quick (unsupported) way to get the previous behavior is to call
`torch._C._set_autograd_fallback("nothing")`
- Register "torch::CppFunction::makeFallthrough()" to your Autograd key,
like in https://gist.github.com/zou3519/d09a5f4b1afe2430af09fea67c6ff2c8
It is possible that this PR regresses performance of overhead-bound
models. If this is the case, please reach out (and apply one of the
temporary fixes in the previous section).
Description for reviewers
-------------------------
In order to deprecate registering autograd kernels at not an autograd
key, we add a fallback to the Autograd dispatch keys. This fallback
raises a warning if the user attempts to backprop through the operator
and is also configurable to either warn or not warn.
The goal of this PR is to
- preserve as much BC as possible
- raise a warning that whatever the user is doing is potentially wrong.
- be as performant as possible
There are roughly two cases:
- if the post-autograd kernels return a Tensor that requires grad, then
we install an autograd hook that raises a warning. We are preserving BC
in that it is possible that the user has a torch::autograd::Function
registered to their CPU key.
- if the post-autograd kernels return Tensors that do not require grad,
then we make them require_grad and install a WarnNotImplemented grad fn
that warns in the backward pass. This is mildy BC-breaking (see next
section).
Test Plan:
- bunch of new tests
BC-Breaking Note
----------------
This PR adds a new fallback to the Autograd dispatch keys. It affects
custom operators that do not have a kernel registered to the Autograd
keys (e.g. AutogradCPU and AutogradCUDA).
If the previous behavior was that the custom operator would return
Tensors that do not require grad if the inputs do require grad, then
this PR changes it so that all floating-point and complex returns do
require grad. See the "Context" section above for how to get the old
behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104481
Approved by: https://github.com/soulitzer
This reduces total number of imported modules by default from 1419 to 1322 according to
```
time python -c "import sys;before=len(sys.modules);import torch;after=len(sys.modules);print(f'torch-{torch.__version__} imported {after-before} modules')"
```
and slightly reduces import time, while having no effect on UX (i.e. `torch.onnx.` submodule is kept intact)
Suppress lint errors that appear after mypy accidentally starts listing more files, for more details see: https://github.com/pytorch/pytorch/issues/104940
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104843
Approved by: https://github.com/jansel, https://github.com/albanD
This is a bug discovered by https://github.com/pytorch/pytorch/pull/104810. Basically, when the PR body is empty, GitHub API returns a None value, which is passed into `parse_reenabled_issues` causing it to fail.
### Testing
```
python3 .github/scripts/filter_test_configs.py \
--workflow "pull" \
--job-name "linux-focal-py3-clang7-android-ndk-r19c-gradle-custom-build-single-full-jit / filter," \
--test-matrix "{ include: [ { config: 'default', shard: 1, num_shards: 1, runner: 'linux.2xlarge' }, ]}" \
--pr-number "104810" \
--tag "" \
--event-name "pull_request" \
--schedule "" \
--branch ""
```
The command works correctly without failing now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104914
Approved by: https://github.com/clee2000
## Problem
Trying to support numpy function call in dynamo, with numpy dtype as argument.
For example:
```
def fn(x: int):
return np.empty_like(x, dtype=np.float64)
```
## Solution
This currently doesn't work because `NumpyVariable` doesn't implement `as_proxy()`. The idea in `as_proxy()` for now is to convert `np.float64` and other np.<dtype> into `torch.dtype` and then feed into the corresponding `torch_np` method.
For previous example, we convert `np.float64` to `torch.float64` in `as_proxy()` and then feed it into `torch_np.empy_like()` method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103546
Approved by: https://github.com/ezyang
## Context prior to this PR
https://github.com/pytorch/pytorch/pull/100017/ was merged onto PyTorch `main` branch with the goal of enabling `torch._dynamo.export` to perform symbolic tracing.
In that context, symbolic tracing is defined as tracing of a model using fake inputs and weights. An input is Fake when `torch.nn.Tensor` is replaced by `torch._subclasses.FakeTensor`, whereas a weight is fake when a `torch.nn.Parameter` is replaced by `torch._subclasses.FakeTensor`.
For additional context, several strategies were discussed with Meta to enable this feature, including 1) calling `torch._dynamo.export` within a `torch._subclass.FakeTensorMode` context and 2) **fake**fying input and model as separate step and then call `torch._dynamo.export` without an active `torch._subclass.FakeTensorMode` context. At the end, 2) was preferred and implemented by #100017 to minimize the number of side-effects the fake tensor mode has on the code base.
As a consequence, `torch._dynamo.export` API introduced a new argument called `fake_mode`. When symbolic tracing is used, the user must pass in the `fake_mode` used to fakefy both the input and the model. Internally, `torch._dynamo.export` will adopt this `fake_mode` instead of creating its own instance. This is needed because each instance of `FakeTensorMode` has metadata on the tensor/parameter it fakefied. Thus, using real tensor/model and specify a `fake_mode` to `torch._dynamo.export` is an error. Also, specify a `fake_mode` instance to `torch._dynamo.export` different than the one used to fakefy the model and input is also an error.
## Changes introduced from this PR
This PR is intended to integrate `torch._dynamo.export(fake_mode=...)` through `torch.onnx.dynamo_export`. In essence, it
* Introduces a new public API `ONNXFakeContext` which wraps a `FakeTensorMode` under the hood. This removes complexity from the user side while still allow the exporter to leverage the fake mode.
* Adds a new public API `enable_fake_mode` *context manager* that instantiates and return a `ONNXFakeContext`.
* Adds a new `ExportOptions.fake_context` that will be used to persist the `ONNXFakeContext` created by `enable_fake_mode` and plumb through until it reaches the call to `torch._dynamo.export`.
* Adds a `model_state_dict` argument to `ExportOutput.save` API.
* When model is exported with fake tensors, no actual data exist in the FX module and, therefore, in the ONNX graph.
* In fact, `torch.fx.make_fx` lifts initializers as model input when fake tensors are used
* https://github.com/pytorch/pytorch/pull/104493 is needed to enforce name matching between Parameters and inputs
* A model checkpoint file or state_dict is needed to populate the ONNX graph with real initializers through `export_output.save(model_state_dict=...)` API
Symbolic tracing, or onnx fake mode, is only enabled when the user instantiates the input and model within the `enable_fake_mode` context. Otherwise, real tracing is done, which preserves the current behavior.
## Usability
Because symbolic tracing depends a lot on having changes made on Dynamo side before it can be consumed on ONNX exporter, this feature may have its API and assumptions changed as symbolic tracing matures upstream. Nonetheless, it is still important to have this feature merged ASAP on the ONNX exporter side to "lock" changes on Dynamo that would otherwise break ONNX exporter without warning.
Example:
```python
class Model(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.linear = torch.nn.Linear(2, 2)
def forward(self, x):
out = self.linear(x)
return out
with torch.onnx.enable_fake_mode() as fake_context:
x = torch.rand(5, 2, 2)
model = Model()
# Export the model with fake inputs and parameters
export_options = ExportOptions(fake_context=fake_context)
export_output = torch.onnx.dynamo_export(
model, x, export_options=export_options
)
model_state_dict = Model().state_dict() # optional
export_output.save("/path/to/model.onnx", model_state_dict=model_state_dict)
```
## Next steps
* Add unit tests running the exported model with ORT
Today this is not possible yet because `make_fx` used by our Decomposition pass lifts initializers as model inputs. However, the initializer names are not preserved by FX tracing, causing a mismatch between the initializer and input name.
https://github.com/pytorch/pytorch/pull/104493 and https://github.com/pytorch/pytorch/pull/104741 should fix the initializer mismatch, enabling model execution
* Revisit `ONNXTorchPatcher` and how the ONNX initializers are saved in the graph as external data
We can try to get rid of the PyTorch patcher. If we can't, we might prefer to create specific patchers, say `FXSymbolicTracePatcher` used specifically during an export using `torch.fx.symbolic_trace` and maybe a `ExportOutputSavePacther` used specifically for `ExportOutput.save` to prevent "patching too many pytorch API that we don't need
## References
* [FakeTensor implementation](https://github.com/pytorch/pytorch/blob/main/torch/_subclasses/fake_tensor.py)
* [PR that adds fake tensor support to torch._dynamo.export](https://github.com/pytorch/pytorch/pull/100017)
* [Short fake tensor documentation](https://pytorch.org/torchdistx/latest/fake_tensor.html)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103865
Approved by: https://github.com/BowenBao
For quantize
```
for (; i < len / VLEN * VLEN; i += VLEN) {
__m512 x_vals = _mm512_load_ps(src + i);
__m512 x_transformed_v = _mm512_mul_ps(x_vals, inverse_scale_v);
x_transformed_v =
_mm512_min_ps(x_transformed_v, _mm512_set1_ps(int32_float_max_val));
__m512i x_rounded_v = _mm512_cvtps_epi32(x_transformed_v);
x_rounded_v = _mm512_add_epi32(x_rounded_v, _mm512_set1_epi32(zero_point));
__m512i x_clipped_v =
_mm512_max_epi32(min_v, _mm512_min_epi32(max_v, x_rounded_v));
x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v);
x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v);
_mm_storeu_si128(
reinterpret_cast<__m128i*>(dst + i),
_mm512_castsi512_si128(x_clipped_v));
}
```
```
x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v);
x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v);
```
is aiming to cast `int32` to `int8` and shuffle 16 `int8` to the first 128 bits.
For example, `A1` represent 8bit
```
x_clipped_v = _mm512_shuffle_epi8(x_clipped_v, shuffle_mask_v);
A1A2A3**A4** B1B2B3**B4** C1C2C3**C4** D1D2D3**D4** -> D4C4B4A4 other 32 * 3 bit
E1E2E3**E4** F1F2F3**F4** G1G2G3**G4** H1H2H3**H4** -> H4G4F4E4 other 32 * 3 bit
I1I2I3**I4** J1J2J3**J4** K1K2K3**K4** L1L2L3**L4** -> L4K4J4I4 other 32 * 3 bit
M1M2M3**M4** N1N2N3**N4** O1O2O3**O4** P1P2P3**P4** -> P4O4N4M4 other 32 * 3 bit
x_clipped_v = _mm512_permutexvar_epi32(permute_mask_l8_v, x_clipped_v);
D4C4B4A4 other 32 * 3 bit -> D4C4B4A4 H4G4F4E4 L4K4J4I4 P4O4N4M4
H4G4F4E4 other 32 * 3 bit other 3 * 4 * 32 bits
L4K4J4I4 other 32 * 3 bit
P4O4N4M4 other 32 * 3 bit
```
Based on https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_permutexvar_epi32&ig_expand=4966,5088.
```
FOR j := 0 to 15
i := j*32
id := idx[i+3:i]*32
dst[i+31:i] := a[id+31:id]
ENDFOR
dst[MAX:512] := 0
```
the `permute_mask_l8_v` should satisfy
```
permute_mask_l8_v[3:0] = 0
permute_mask_l8_v[3 + 32:0 + 32] = 4
permute_mask_l8_v[3 + 64:0 + 64] = 8
permute_mask_l8_v[3 + 96:0 + 96] = 12
```
The other part of `permute_mask_l8_v` does not matters.
`AVX2` version is correct.
It is not exposed before it is only called with fixed length `64` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_qint.h#L545-L546.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104400
Approved by: https://github.com/jgong5, https://github.com/mingfeima, https://github.com/jerryzh168
Current TCPStore wait logic leaves the client socket in a bad state if waiting timesout.
This happens because all recv functions raise an exception on timeout and that's it.
The problem is that on timeout we need to unregister the wait.
We implement this with client side cancelation by adding a new CANCEL_WAIT instruction.
So, if no data arrives before the deadline, the client sends a CANCEL_WAIT command.
The server sends a WAIT_CANCELED response to that command, always.
This gets us down to the last issue, which is that there's a race between timeout'ing,
canceling the wait and the wait completing. The client needs to handle the server sending
a STOP_WAITING followed by a WAIT_CANCELED answer.
This ensures client and server state are synchronized regardless of whether the wait
timeouts or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100594
Approved by: https://github.com/H-Huang
Python `mod` semantics is not the same as the mathematical modulus operation. According to
the Python reference: `a = floor(a / b) * b + a % r`.
In other words: `a % b = a - floor(a / b) * b`.
This PR fixes the old implementation which used SMT-LIB2 semantics for `mod`. In short, it
only worked with integers and had the following guarantee: `0 <= a % b < b`.
In summary, the changes are:
- `a % b = a - floordiv(a, b) * b`
- `a` and `b` can be both integer or real
- The result will be real if any of the arguments is real. Otherwise, it will be integer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104827
Approved by: https://github.com/lezcano
Originally, we didn't enable BWD for colwise embedding because we thought it was just for inference, but it turns out that we do need it for training. So, let's enable it for now and unit test is also added.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104820
Approved by: https://github.com/fegin