Summary:
I am adding documentation for building the C++-only libtorch.so without invoking Python in the build and install process. This works on my Ubuntu 20.04 system and is designed to be operating system agnostic.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44196
Reviewed By: zou3519
Differential Revision: D24421066
Pulled By: malfet
fbshipit-source-id: e77c222703353ff7f7383fb88f7bce705f88b7bf
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46244
- What does the generated binding code do?
The Python binding codegen produces code that takes the input list of
PyObjects, finds the matching ATen C++ function using PythonArgParser,
converts the PyObjects into C++ types and calls the ATen C++ function:
```
+--------+ parsing +------------------------+ binding +-----------------------+
| PyObjs | ---------> | PythonArgParser Output | ---------> | Cpp Function Dispatch |
+--------+ +------------------------+ +-----------------------+
```
- Are Python arguments 1-1 mapped to C++ arguments?
Python arguments might be reordered, packed, unpacked when binding to
C++ arguments, as illustrated below:
```
// Binding - Reorder & Packing
// aten::empty.names(int[] size, *, Dimname[]? names, ScalarType? dtype=None, Layout? layout=None,
Device? device=None, bool? pin_memory=None, MemoryFormat? memory_format=None) -> Tensor
Python Args Cpp Args
-----------------------------------------------------------
0: size size
1: names names
2: memory_format -------+
3: dtype -----+-|--> options
4: layout / |
5: device / +--> memory_format
6: pin_memory /
7: requires_grad -+
// Binding - Unpacking
// aten::max.names_dim(Tensor self, Dimname dim, bool keepdim=False) -> (Tensor values, Tensor indices)
Python Args Cpp Args
-----------------------------------------------------------
+----> max
/-----> max_values
0: input / self
1: dim / dim
2: keepdim / keepdim
3: out -----+
```
- Why do we want to rewrite the python binding codegen?
The old codegen takes Declarations.yaml as input. It doesn't distinguish
between Python arguments and C++ arguments - they are all mixed together
as a bag of non-typed dict objects. Different methods process these arg
objects and add new attributes for various different purposes. It's not so
obvious to figure out the semantics of these attributes. The complicated
binding logic happens implicitly and scatteredly.
```
+--------------------+
| Native Functions |
+--------------------+
|
|
v
+--------------------+
| Cpp Signatures |
+--------------------+
|
|
v
+--------------------+
| Declarations.yaml |
+--------------------+
| +-------------------------------------+
| +-------> | PythonArgParser Schema |
| | +-------------------------------------+
| | .
| | .
v | .
+--------------------+ +-------------------------------------+
| NonTyped Args Objs | --> | PythonArgParser -> Cpp Args Binding |
+--------------------+ +-------------------------------------+
| .
| .
| .
| +-------------------------------------+
+-------> | Cpp Function Dispatch |
+-------------------------------------+
```
This PR leverages the new immutable data models introduced in the new
aten codegen. It introduces dedicated data models for python schema.
This way, we can not only avoid subtle Declaration.yaml conversions but
also decouple the generation of python schema, python to c++ binding and
c++ function call.
The ultimate state will be like the following diagram:
```
+-------------------+ +-------------------------------------+
+-------> | Python Signatures | --> | PythonArgParser Schema |
| +-------------------+ +-------------------------------------+
| | .
| | .
| | .
+------------------+ | +-------------------------------------+
| Native Functions | +-------> | PythonArgParser -> Cpp Args Binding |
+------------------+ | +-------------------------------------+
| | .
| | .
| | .
| +-------------------+ +-------------------------------------+
+-------> | Cpp Signatures | --> | Cpp Function Dispatch |
+-------------------+ +-------------------------------------+
```
This PR has migrated the core binding logic from
tools/autograd/gen_python_functions.py to tools/codegen/api/python.py.
It produces the byte-for-byte same results (tested with #46243).
Will migrate the rest of gen_python_functions.py in subsequent PRs.
Test Plan: Imported from OSS
Reviewed By: bhosmer
Differential Revision: D24388874
Pulled By: ljk53
fbshipit-source-id: f88b6df4e917cf90d868a2bbae2d5ffb680d1841
Summary:
Retake on https://github.com/pytorch/pytorch/issues/40493 after all the feedback from albanD
This PR implements the generic Lazy mechanism and a sample `LazyLinear` layer with the `UninitializedParameter`.
The main differences with the previous PR are two;
Now `torch.nn.Module` remains untouched.
We don't require an explicit initialization or a dummy forward pass before starting the training or inference of the actual module. Making this much simpler to use from the user side.
As we discussed offline, there was the suggestion of not using a mixin, but changing the `__class__` attribute of `LazyLinear` to become `Linear` once it's completely initialized. While this can be useful, by the time being we need `LazyLinear` to be a `torch.nn.Module` subclass since there are many checks that rely on the modules being instances of `torch.nn.Module`.
This can cause problems when we create complex modules such as
```
class MyNetwork(torch.nn.Module):
def __init__(self):
super(MyNetwork, self).__init__()
self.conv = torch.nn.Conv2d(20, 4, 2)
self.linear = torch.nn.LazyLinear(10)
def forward(self, x):
y = self.conv(x).clamp(min=0)
return self.linear(y)
```
Here, when the __setattr__ function is called at the time LazyLinear is registered, it won't be added to the child modules of `MyNetwork`, so we have to manually do it later, but currently there is no way to do such thing as we can't access the parent module from LazyLinear once it becomes the Linear module. (We can add a workaround to this if needed).
TODO:
Add convolutions once the design is OK
Fix docstrings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44538
Reviewed By: ngimel
Differential Revision: D24162854
Pulled By: albanD
fbshipit-source-id: 6d58dfe5d43bfb05b6ee506e266db3cf4b885f0c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45164
This PR implements `fft2`, `ifft2`, `rfft2` and `irfft2`. These are the last functions required for `torch.fft` to match `numpy.fft`. If you look at either NumPy or SciPy you'll see that the 2-dimensional variants are identical to `*fftn` in every way, except for the default value of `axes`. In fact you can even use `fft2` to do general n-dimensional transforms.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D24363639
Pulled By: mruberry
fbshipit-source-id: 95191b51a0f0b8e8e301b2c20672ed4304d02a57
Summary:
The `i` variable in `Line 272` may cause ambiguity in understanding. I think it should be named as `epoch` variable.
Fixes #{issue number}
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45944
Reviewed By: agolynski
Differential Revision: D24219486
Pulled By: vincentqb
fbshipit-source-id: 2af0408594613e82a1a1b63971650cabde2b576e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46075
Removes these from public docs for now as we are still
iterating/formalizing these APIs. Will add them back once they are part of a
PyTorch release.
ghstack-source-id: 113928700
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D24211510
fbshipit-source-id: 3e36ff6990cf8e6ef72b6e524322ae06f9097aa2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45847
Original PR here https://github.com/pytorch/pytorch/pull/45084. Created this one because I was having problems with ghstack.
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D24136629
Pulled By: heitorschueroff
fbshipit-source-id: dd7c7540a33f6a19e1ad70ba2479d5de44abbdf9
Summary:
Currently, a GraphRoot instance doesn't have an associated stream. Streaming backward synchronization logic assumes the instance ran on the default stream, and tells consumer ops to sync with the default stream. If the gradient the GraphRoot instance passes to consumer backward ops was populated on a non-default stream, we have a race condition.
The race condition can exist even if the user doesn't give a manually populated gradient:
```python
with torch.cuda.stream(side_stream):
# loss.backward() implicitly synthesizes a one-element 1.0 tensor on side_stream
# GraphRoot passes it to consumers, but consumers first sync on default stream, not side_stream.
loss.backward()
# Internally to backward(), streaming-backward logic takes over, stuff executes on the same stream it ran on in forward,
# and the side_stream context is irrelevant. GraphRoot's interaction with its first consumer(s) is the spot where
# the side_stream context causes a problem.
```
This PR fixes the race condition by associating a GraphRoot instance, at construction time, with the current stream(s) on the device(s) of the grads it will pass to consumers. (i think this relies on GraphRoot executing in the main thread, before backward thread(s) fork, because the grads were populated on the main thread.)
The test demonstrates the race condition. It fails reliably without the PR's GraphRoot diffs and passes with the GraphRoot diffs.
With the GraphRoot diffs, manually populating an incoming-gradient arg for `backward` (or `torch.autograd.grad`) and the actual call to `autograd.backward` will have the same stream-semantics relationship as any other pair of ops:
```python
# implicit population is safe
with torch.cuda.stream(side_stream):
loss.backward()
# explicit population in side stream then backward in side stream is safe
with torch.cuda.stream(side_stream):
kickoff_grad = torch.ones_like(loss)
loss.backward(gradient=kickoff_grad)
# explicit population in one stream then backward kickoff in another stream
# is NOT safe, even with this PR's diffs, but that unsafety is consistent with
# stream-semantics relationship of any pair of ops
kickoff_grad = torch.ones_like(loss)
with torch.cuda.stream(side_stream):
loss.backward(gradient=kickoff_grad)
# Safe, as you'd expect for any pair of ops
kickoff_grad = torch.ones_like(loss)
side_stream.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(side_stream):
loss.backward(gradient=kickoff_grad)
```
This PR also adds the last three examples above to cuda docs and references them from autograd docstrings.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45787
Reviewed By: nairbv
Differential Revision: D24138376
Pulled By: albanD
fbshipit-source-id: bc4cd9390f9f0358633db530b1b09f9c1080d2a3
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45543
This PR adds documentation for the c10d Store to the public docs. Previously these docs were missing although we exposed a lightly-used (but potentially useful) Python API for our distributed key-value store.
ghstack-source-id: 113409195
Test Plan: Will verify screenshots by building the docs.
Reviewed By: pritamdamania87
Differential Revision: D24005598
fbshipit-source-id: 45c3600e7c3f220710e99a0483a9ce921d75d044
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45232
**Summary**
This commit updates the TorchScript language reference to include
documentation on recently-added TorchScript enums. It also removed
`torch.no_grad` from the list of known unsupported `torch` modules and
classes because it is now supported.
**Test Plan**
Continuous integration.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D23971884
Pulled By: SplitInfinity
fbshipit-source-id: 5e2c164ed59bc0926b11201106952cff86e9356e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45294
While tracking down a recent memory corruption bug we found that
cuda-memcheck wasn't finding the bad accesses, and ngimel pointed out that
it's because we use a caching allocator so a lot of "out of bounds" accesses
land in a valid slab.
This PR adds a runtime knob (`PYTORCH_NO_CUDA_MEMORY_CACHING`) that, when set,
bypasses the caching allocator's caching logic so that allocations go straight
to cudaMalloc. This way, cuda-memcheck will actually work.
Test Plan:
Insert some memory errors and run a test under cuda-memcheck;
observe that cuda-memcheck flags an error where expected.
Specifically I removed the output-masking logic here:
https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L819-L826
And ran:
```
PYTORCH_NO_CUDA_MEMORY_CACHING=1 cuda-memcheck pytest -k test_superslomo test_jit_fuser_te.py
```
Reviewed By: ngimel
Differential Revision: D23964734
Pulled By: bertmaher
fbshipit-source-id: 04efd11e8aff037b9edde80c70585cb820ee6e39
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45306
Adds details to the main quantization doc on how specifically
users can skip or customize quantization of layers.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23917034
Pulled By: vkuzo
fbshipit-source-id: ccf71ce4300c1946b2ab63d1f35a07691fd7a2af
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45305
Adds an explanatation for reduce_range to the main quantization
doc page.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23916669
Pulled By: vkuzo
fbshipit-source-id: ef93fb774cb15741cd92889f114f6ab76c39f051
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45135
The previous quantization summary had steps on what to do for
dynamic, static, QAT. This PR moves these steps to comments in the
example code, so it is more clear how to accomplish the steps.
Test Plan: Imported from OSS
Reviewed By: jerryzh168
Differential Revision: D23842456
Pulled By: vkuzo
fbshipit-source-id: db2399e51e9ae33c8a1ac610e3d7dbdb648742b0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45093
This adds a tl;dr; style summary of the quantization API
to the documentation. Hopefully this will make this easier
for new folks to learn how to use quantization.
This is not meant to be all-encompassing. Future PRs
can improve the documentation further.
Test Plan:
1. build the doc as specified in https://github.com/pytorch/pytorch#building-the-documentation
2. inspect the quantization page in Chrome, format looks good
Reviewed By: jerryzh168
Differential Revision: D23828257
Pulled By: vkuzo
fbshipit-source-id: 9311ee3f394cd83af0aeafb6e2fcdc3e0321fa38
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45356
In this PR, I'm adding a warning to the PG backend mentioning it would
be deprecated in the future. In addition to this I removed the warning from the
TP backend that it is a beta feature.
ghstack-source-id: 112940501
Test Plan: waitforbuildbot
Reviewed By: mrshenli
Differential Revision: D23940144
fbshipit-source-id: d44054aa1e4ef61004a40bbe0ec45ff07829aad4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45188
This is a symbolically traceable alternative to Python's `assert`.
It should be useful to allow people who want to use FX to also
be able to assert things.
A bunch of TODO(before) land are inline - would love thoughts
on where is the best place for this code to live, and what this
function should be called (since `assert` is reserved).
Test Plan:
```
python test/test_fx.py TestFX.test_symbolic_trace_assert
```
Imported from OSS
Reviewed By: jamesr66a
Differential Revision: D23861567
fbshipit-source-id: d9d6b9556140faccc0290eba1fabea401d7850de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44550
Part of the `torch.fft` work (gh-42175).
This adds n-dimensional transforms: `fftn`, `ifftn`, `rfftn` and `irfftn`.
This is aiming for correctness first, with the implementation on top of the existing `_fft_with_size` restrictions. I plan to follow up later with a more efficient rewrite that makes `_fft_with_size` work with arbitrary numbers of dimensions.
Test Plan: Imported from OSS
Reviewed By: ngimel
Differential Revision: D23846032
Pulled By: mruberry
fbshipit-source-id: e6950aa8be438ec5cb95fb10bd7b8bc9ffb7d824
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45235
This is so that users know that the profiler works as expected with
RPC and they can learn how to use it to profile RPC-based workloads.
ghstack-source-id: 112773748
Test Plan: CI
Reviewed By: mrshenli
Differential Revision: D23777888
fbshipit-source-id: 4805be9b949c8c7929182f291a6524c3c6a725c1