Unclear if there is a more efficient way to define the allowed types for IR (or if we even need this, perhaps we just ditch the assert?) But Inductor experts can deteremine if these added ops are appropriate and if so they fix the reported issue.
Fixes#96204
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96221
Approved by: https://github.com/ezyang
test-infra's linux_job uses github.ref as the default value for the ref, which is the branch, so it checks out the most recent commit on the branch.
Might be better to fix this on the test-infra side instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96317
Approved by: https://github.com/huydhn
expecttest is not imported to OSS BUCK build yet. Using it in target test_torchgen_executorch breaks build.
Remove it first to fix the build. Will import and fix in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96314
Approved by: https://github.com/huydhn
Summary: ciflow/inductor-perf-test-nightly now contains full dashboard
run which takes a very long time. Ed proposed a simplification of the
perf run there, but it is still worth to have a set of fast perf test
which only includes one configuration (--training --amp).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96166
Approved by: https://github.com/huydhn, https://github.com/weiwangmeta
# Summary
This PR adds an optional kwarg to torch torch.nn.functional.scaled_dot_product_attention()
The new kwarg is a scaling factor that is applied after the q@k.T step of the computation. Made updates to the efficient kernel to support but flash and math were minimally updated to support as well.
Will reduce the complexity of: #94729 and has been asked for by a couple of users.
# Review Highlights
- As far as I know I did this the correct way and this both BC and FC compliant. However I always seem to break internal workloads so I would love if someone can advice I did this right?
- I named the optional arg 'scale'. This is probably dumb and I should name it 'scale_factor'. I will make this change but this is annoying and it will require someone thinking we should rename.
- 'scale' is interpreted as `Q@K.T * (scale)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95259
Approved by: https://github.com/cpuhrsch
Summary: This commit adds a test for mixing multiple dtypes
for different layers in the same model. The test verifies that
FX graph mode quantization converts the dtypes correctly
between the layers.
Test Plan:
python test/test_quantization.py TestQuantizeFx.test_mixed_dtypes
Reviewers: jcaip, vkuzo, supriyar
Subscribers: jcaip, vkuzo, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96104
Approved by: https://github.com/jcaip
This makes the next PR in the stack cleaner: having the top level entry point to aot autograd perform the functionalization analysis pass once, and plumb the metadata everywhere else that we need it.
I put it in a separate PR because I recently learned that this function is used in fbcode, so I'll need to fix up internals when I land this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95991
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/95167
More details are in that issue. To summarize, the issue shows up when we have some code like this:
```
def f(x):
x.detach().mul_(2) # can also happen if the mul_() happens under torch.no_grad()
return x + 1
```
AOTAutograd will then spit out code like this:
```
def compiled_fn(x):
x_updated = x.mul(2)
out = x_updated + 1
return x_updated, out
def CompiledFunction.forward(x): # pseudocode, this is part of an autograd.Function
x_updated, out = compiled_function(x):
return x_updated, out
def runtime_wrapper(x):
x_updated, out = CompiledFunction.apply(x)
x.copy_(x_updated)
x = torch.ones(2, requires_grad=True)
out = runtime_wrapper(x)
```
However, the call to `x.copy_(x_updated)` will fail with the error: `a leaf Variable that requires grad is being used in an in-place operation`. This is because `x` is an autograd leaf, and autograd doesn't allow you to mutate leaves.
In this case though, the data mutation should be entirely opaque to autograd - all mutations happened underneath a `.detach()` or a `torch.no_grad()`.
As Ed pointed out in the issue, we can detect this situation by checking if the mutated input is an autograd leaf. If it is, then it must have been the case that any mutations on it must have been hidden from autograd, since otherwise the eager code would have error'd. The solution I added is to detect this situation, and manually run `x.detach().copy_(x_updated)`, to hide the update from autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95980
Approved by: https://github.com/ezyang
Previously, if dynamic shapes were turned on and we had a forward graph that returns a symint, then we would generate a backward graph that takes in a tangent input for that symint fwd output. This causes problems for downstream - inductor will see an input that it expects to be a symint, but it gets a `None` from autograd.
Confirmed that this repro now passes:
```
benchmarks/dynamo/torchbench.py --devices cuda --inductor --dynamic-shapes --unspecialize-int --accuracy --training --only drq
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96219
Approved by: https://github.com/ezyang
Summary: The original code uses a class variable to store flat_parameter result. This could cause memory leakage.
Test Plan: CI and a E2E run
Reviewed By: awgu
Differential Revision: D43893577
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96263
Approved by: https://github.com/zhaojuanmao
Summary:
## Summary
torch.nn.functional.pixel_unshuffle and torch.narrow accepts both float
and quantized inputs. However, previously we would unnecessarily
dequantize quantized inputs into floats before passing them to
the function. This commit fixes this by lowering the pattern
[dequant - pixel_unshuffle - quant].
[dequant - narrow - quant].
Test Plan:
```
python test/test_quantization.py TestQuantizeFxOps.test_pixel_unshuffle
```
```
python test/test_quantization.py TestQuantizeFxOps.test_narrow
```
Differential Revision: D43858199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96160
Approved by: https://github.com/andrewor14
sccache added GH cache as a storage option, so try to use it for the GH provided mac runners.
My experiments with this are varied. I tried a couple of different releases and the first run with a cold cache took 1hr (v0.3.3), 1hr (v0.4.0 pre7), 2hr (v0.3.3).
Afterwards it usually takes 30 minutes but sometimes longer, but no longer than 1hr.
I am using v0.4.0 pre7 because they reduced the amount of configuration/env vars you need to set and the GH cache keys get managed by sccache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96142
Approved by: https://github.com/huydhn, https://github.com/malfet
Summary:
In ATen mode, we add the RuntimeContext arg, so we have something like
```
TORCH_API inline at::Tensor & gelu_outf(torch::executor::RuntimeContext & context, const at::Tensor & self, c10::string_view approximate, at::Tensor & out) {
return at::gelu_outf(self, approximate, out);
}
```
and user can use `<namespace like aten>::gelu_outf` and we will automatically dispatch the registered function in aten kernel using `at::gelu_outf` (dispatched by ATen/Functions.h header)
In optimized kernel tests, we can now automatically handle between aten kernel and optimized kernel.
The implication is that the test must depend on the correctness of codegen; an error in codegen can break the kernel tests.
Test Plan: CI
Differential Revision: D43777848
Pull Request resolved: https://github.com/pytorch/pytorch/pull/96084
Approved by: https://github.com/larryliu0820