Co-authored-by: Ashkan <yousefpour@fb.com>
Adds "same" and "valid" padding support, as Opacus (well @ashkan-software) did https://github.com/pytorch/opacus/pull/451
Basics of it are this:
- during forward pass, if there's "same" padding, we manually pad the input (NB: this will cause a small perf hit, haven't benchmarked yet)
- during backward pass, the gradient wrt input needs to be cut down to the correct size if the original padding was same (conv_transpose doesn't accept string padding). Because conv_transpose will give us a gradient wrt the padded shape, we cut down the gradient to the correct size (we know how much padding we added to the left and right)
- then, for the per sample gradients wrt weights, the input is already padded so neither the unfold nor group convolution have any padding
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83345
Approved by: https://github.com/zou3519
This PR updates the `tools/onnx/update_default_opset_version.py` script to ensure files are edited correctly to prepare for the opset 17 support in torch.onnx.
- (clean up) Move script to `main()`
- Add an `--skip_build` option to avoid building pytorch if we want to rerun the process due to errors after compilation is done
- Update to edit the correct files now that the onnx files were refactored
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83283
Approved by: https://github.com/thiagocrepaldi, https://github.com/AllenTiTaiWang, https://github.com/abock
# Summary
This is PR is pulling out all the changes from #81838 specific to properly creating nested_tensor views. I will update this comment with a design doc once that has been made. This should enable proper creation of NestedTensor views, two nested_tensors sharing the same buffer_ but with different NestedTensor meta data.
The function `create_nested_tensor_view` is a helper function for creating a new nested tensor whose storage aliases the base causing the underlying storage to be shared - and is therefore a view.
This function by itself is not differentiable and therefore autograd does not track its uses. If a nested tensor function implementation uses this helper in its implementation the aten_op must meet two requirements:
- The function must return a view of the input
- The function must be explicit and defines its backward
## Testing
A bug was found when creating a base tensor out of inference mode and then creating a view in inference mode. This test has been aded to this PR in order to show the effect of the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82658
Approved by: https://github.com/albanD
This adds vectorization to the copy kernel acting between different
dtypes through the use of `at::vec::convert`. Currently `vec::convert`
falls back to a scalar copy loop for most dtypes, however the compiler
is still better able to auto-vectorize the loop since it doesn't
involve stride calculations.
In a simple timeit benchmark I see around a 2x speedup copying from
int32 to various dtypes:
| To dtype | Master (us) | This PR (us) |
|----------|-------------|--------------|
| int64 | 23.8 | 10.3 |
| float32 | 16.8 | 8.18 |
| float64 | 18.0 | 9.47 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/80905
Approved by: https://github.com/ngimel
`lcm` is meant to use integer maths, but the use of `true_divide`
introduces a promotion to float and thus a loss of precision.
This also introduces promoting low precision integers to int32 which
is required for 100% consistency with the C++ implementation since the
"usual arithmetic conversions" means the intermediate terms are
calculated to `int` precision in C++. This only really matters when the
lower precision dtype would overflow, however the test cases for lcm
do involve overflows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82950
Approved by: https://github.com/ngimel
### Description
This enables some unit tests related to BatchNorm for ROCM. We make sure that we call the MIOpen library incases where it can handle it and use the default path in other cases. When MIOpen implements this specific case we will file a follow up PR enabling that code path.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/82512
Approved by: https://github.com/jeffdaily, https://github.com/albanD
On my machine `ScanKernels.cu` takes 10 minutes for just a single
architecture which is by far the highest compile time of any single
file. So this splits it into multiple files, the slowest being
`LogcumsumexpKernel.cu` which takes 2m 30s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83422
Approved by: https://github.com/ngimel
This problem updates the the PR [#73040](https://github.com/pytorch/pytorch/pull/73040)
The compilation error in pyTorch with ROCm is successful with these changes when `NDEBUG` is enabled.
Solution:
For HIP we keep `__device__ __assert_fail()`
and for host side compilation we want to use the `__assert_fail()` from the glibc library.
Tested the code by compiling with below steps
```
python3 tools/amd_build/build_amd.py
python3 setup.py develop --cmake-only
cmake -DHIP_HIPCC_FLAGS_RELEASE="-DNDEBUG" build
cmake --build build
```
The UT test_fixed_cuda_assert_async is still skipped due performance overhead.
cc @jithunnair-amd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81790
Approved by: https://github.com/shintaro-iwasaki, https://github.com/jeffdaily, https://github.com/malfet
The current implementation of the `sum_mean_dim` shape function
takes `dim=[]` and `dim=None` to mean "no reduction". However, in the
ops `torch.sum` and `torch.mean`, both `dim=[]` and `dim=None` are
equivalent to "reduce along all dimensions". This commit fixes the
handling of `dim` in the `sum_mean_dim` shape function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83357
Approved by: https://github.com/Gamrix
We're on our way to deleting ProxyTensor entirely (see https://github.com/pytorch/pytorch/pull/83330 ), but before we can do that, we have to delete ProxySymInt first. Here's the plan.
Changes in torch.fx.experimental.symbolic_shapes
* The general idea is to do mode based tracing. This means we need a mode that can interpose on all SymInt operations. There are a few ways to do this, but I've done it the easy way: (1) I have a separate mode for SymInt operations specifically called SymDispatchMode, and (2) this mode operates on PySymInt (and not the basic SymInt which is user visible). I elided Int from the name because if we add SymFloats I want to use the same mode to handle those as well, and I used Dispatch rather than Function because this is the "inner" dispatch operating PySymInt and not SymInt (this is not a perfect analogy, but SymFunctionMode definitely seemed wrong as you still must go through the C++ binding.) The mode is entirely implemented in Python for ease of implementation. We could have implemented this more symmetrically to TorchFunctionMode in C++, but I leave that as later work; this API is unlikely to get used by others (unlike TorchFunctionMode). One downside to not doing the mode in C++ is that we still have to do the hop via a preexisting PySymInt to wrap; this is currently not a big deal as conversion to SymInts only really happens when there is already another SymInt floating around. SymDispatchMode is pared down from TorchDispatchMode; there is no ancestor tracking since I don't expect people to be mixing up SymDispatchModes.
* I made some improvements for tracing. When I invoke the SymDispatchMode handler, I would like constants to show up as constants, so they can be directly inlined into the FX graph (rather than going through a wrapping process first, and then the wrapped SymInt being used in the operation). To do this, I directly track if a PySymInt is a constant at construction time. Only wrapped PySymInts are constants.
* For convenience, PySymInts now support all magic methods that regular SymInts do. This is so that redispatch inside the SymDispatchMode can be written the idiomatic way `func(*args, **kwargs)` where func is an operator. The original names are retained for direct C++ calls.
Changes in torch.fx.experimental.proxy_tensor
* OK, so we got a new SymDispatchMode, so we define a ProxySymDispatchMode and activate it when we start tracing. This mode is currently unconditionally activated although technically we only need to activate it when doing symbolic tracing (it doesn't matter either way as there are no SymInts if you are not doing symbolic tracing).
* We delete ProxySymInt. To do this, we must now record the proxy for the SymInt some other way. Based on discussion with Chillee, it is more intuitive to him if the proxies are still recorded on the SymInt in some way. So we store them in the `__dict__` of the PySymInt, indexed by Tracer. An improvement is to make this a weak map, so that we remove all of these entries when the tracer dies. In an original version of this PR, I keyed on the mode itself, but tracer is better as it is accessible from both modes (and as you will see, we will need to fetch the map from both the ProxySymDispatchMode as well as the ProxyTorchDispatchMode.) The implementation of SymDispatchMode now simply retrieves the proxies, performs the underlying operation as well as the FX graph recording, and then records the output proxy to the PySymInt. Note that FX tracing does not work with proxies and SymInts, so we manually call `call_function` to ensure that the correct operations get recorded to the graph. This means conventional FX retracing with proxies only will not work with these graphs, but there wasn't really any reason to do this (as opposed to `make_fx` retracing) anyway. Constants are detected and converted directly into Python integers.
* SymInts can show up as arguments to tensor operations, so they must be accounted for in ProxyTorchDispatchMode as well. This is done by searching for SymInt arguments and converting them into proxies before the proxy call. This can be done more efficiently in a single `tree_map` but I'm lazy. The helper `unwrap_symint_proxy` conveniently implements the unwrapping in one place given a tracer; unfortunately it cannot be shared with SymDispatchMode as SymDispatchMode gets PySymInts, but ProxyTensorMode gets SymInts. Similarly, tensors that are returned from tensor operations can have SymInts in their shapes, which need fresh proxies allocated. To avoid leaking internal details of SymInt shape computation to the tensor operation graph, these SymInts are always given proxies derived from `x.size(dim)` call on their return tensor. We also need to do this for strides and numel but have not done so yet. Furthermore, we must avoid tracing internal SymInt calls while we run meta operations on the true operation; this is achieved by also disabling SymInt tracing on the inside of tensor tracing. This is analogous to how tensor tracing is disabled inside the implementation of tracing mode, but unfortunately we are unable to use the same mechanism (this would have been easier if the two modes could be combined somehow, and I am amenable to suggestions to try harder to achieve this.)
* Because there are no more ProxySymInts, we no longer need to do anything to unwrap SymInt. Furthermore, we do not need to reallocate ProxySymInts on class creation.
* If a bare SymInt without a Proxy is encountered, it is assumed that this must be a constant. `create_arg` handles this case. Non-constant free SymInts result in an assert error.
* The initial input handling in `dispatch_trace` involves traversing all of the input tensors, traversing over their shapes, and assigning proxies for the SymInts in shapes in the same way we handle proxies for the output tensors.
The preexisting testing is inadequate but will be better after I rebase past https://github.com/pytorch/pytorch/pull/82209
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83380
Approved by: https://github.com/samdow
I noticed I was missing tensor creations with modes when I tried
to delete proxy tensor. This was the cause.
Hypothetically, all PyInterpreter calls could get this treatment.
But I think it only matters for detach; the rest do not return
Tensors and most modes will not be interested in them.
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/83372
Approved by: https://github.com/zou3519
Add `TensorImpl::sym_strides`, bind it to python with `torch.ops.aten.sym_strides`, and use it in `ProxyTensor` and `FakeTensor`.
Before, `ProxyTensor` was generating `ProxySymInt`'s for the sizes, but not for the strides. Internally we still represent strides with a `SymIntArrayRef` though, so I ran into some weird issues where sizes were showing up as `ProxySymInt`, but strides were `PySymInt`'s.
Differential Revision: [D38594558](https://our.internmc.facebook.com/intern/diff/D38594558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/81300
Approved by: https://github.com/ezyang