Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10651
EnsureCPUOutputOp will copy the input from another Context to CPU, but currently there is no guarantee that the Copy will be executed.
Differential Revision: D9390046
fbshipit-source-id: af3ff19cf46560264cb77d2ab8821f0cc5be74f6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10551
Renaming from "subtree" -> "subgraph" to improve clarity of subgraph matcher APIs since it now supports DAG
This is pure renaming, no functionalities change.
Reviewed By: bwasti
Differential Revision: D9348311
fbshipit-source-id: 4b9267845950f3029dfe385ce3257d3abb8bdad4
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10549
Support dag matching in nomnigraph. This is done by maintaining a map from node in the MatchGraph to node in the input graph, and additionally enforce that same nodes in the MatchGraph must match to same nodes in the input graph (with the exception of multiplicity i.e. when count != 1 on the MatchGraph node).
In a follow up diff, I'll rename the API that refers to subtree as subgraph to improve clarity.
Reviewed By: bwasti
Differential Revision: D9347322
fbshipit-source-id: 171491b98c76852240a253279c2654e96dd12632
Summary:
Some more `ATEN_API` additions for hidden visibility.
Running CI tests to see what fails to link.
cc Yangqing mingzhe09088 ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10624
Reviewed By: mingzhe09088
Differential Revision: D9392728
Pulled By: orionr
fbshipit-source-id: e0f0861496b12c9a4e40c10b6e0c9e0df18e8726
Summary:
Minor fix for the cuDNN cache. Previously we would skip the event reinitialization when an RNN function would be called on GPU 0, and then on GPU 1, but it would be in eval mode on GPU1. That would cause us to skip event re-initialization, and cause an incorrect resource handle error when trying to record the event.
soumith
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10662
Reviewed By: soumith
Differential Revision: D9393629
Pulled By: apaszke
fbshipit-source-id: e64c1c1d2860e80f5a7ba727d0b01aeb5f762d90
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/9888
Limiter cannot be shared or copied; just pass it to the first reader.
Reviewed By: xianjiec
Differential Revision: D9008871
fbshipit-source-id: e20cd785b26b1844e156efc3833ca77cfc3ffe82
Summary:
Trigonometry functions are newly added to ONNX in a recent PR https://github.com/onnx/onnx/pull/869
This PR makes pytorch support exporting graphs with trigonometry functions.
This PR might need to wait until it is ready to change
```python
_onnx_opset_version = 6
```
to
```python
_onnx_opset_version = 7
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/7540
Differential Revision: D9395041
Pulled By: bddppq
fbshipit-source-id: bdf3e9d212b911c8c4eacf5a0753bb092e4748d2
Summary:
There is no reason that user should do an extra import to use DistributedSampler.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10671
Differential Revision: D9395189
Pulled By: SsnL
fbshipit-source-id: 8f41d93813c8fb52fe012f76980c6a261a8db9b2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10478
- Removed Backend constructor from Device, and fixed all
use-sites to use DeviceType::CPU instead of kCPU, or
use a new function backendToDeviceType to perform
the conversion.
- New method device_type() on Type; it gives you the
underlying device type, e.g., CPU for SparseCPU.
- We add backward compatibility for kCPU/kCUDA uses,
by introducing a new special type which is implicitly
convertible to both DeviceType and Backend. As long as
you don't define a function that's overloaded on both
DeviceType and Backend (but not on BackendOrDeviceType),
the implicit conversions will ensure that uses
of at::Device(at::kCPU) keep working. We fixed use-sites in
the library, but did NOT fix sites in the test code, so that
we can exercise this BC code.
Reviewed By: Yangqing
Differential Revision: D9301861
fbshipit-source-id: 9a9d88620500715c7b37e655b4fd761f6dd72716
Summary:
... to avoid slow at::chunk (it is slow due to tensor initialization). Picking up from #10026
This is done through the following:
1) Absorb starting chunks into FusionGroup as a part of the graph fuser
pass.
2) When compiling a kernel, emit a `std::vector<ConcatDesc>` that describes if an input (of the original graph) will be chunked.
3) When launching a kernel, `use std::vector<ConcatDesc>` to chunk an
input tensor on the CPU. This chunk directly takes in an at::Tensor and creates
four TensorInfo structs in-place in the argument list, bypassing the creation of intermediate Tensors.
- Expect test and correctness test to see if a single chunk is fused
by the graph fuser
- Correctness test for a variety of chunks (dimension = beginning,
middle, end) and tensors (contiguous, non-contiguous, edge case
(splitSize = 1) for both CPU/CUDA
- Expect test for multiple chunks fused into the same kernel and
correctness test.
cc zdevito apaszke
LSTM forward pass, 1 layer, 512 hidden size and input size, 100 seq length, requires_grad=False on all inputs and weights.
After changes:
```
thnn cudnn jit
8.8468 6.5797 9.3470
```
Before changes:
```
thnn cudnn jit
9.9221 6.6539 11.2550
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10178
Differential Revision: D9382661
Pulled By: zou3519
fbshipit-source-id: 1f8a749208fbdd45559775ce98cf4eb9558448f8
Summary:
Take 2 of #10543
The problem was that between commit and merge there was added one more run point `tools/build_libtorch.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10659
Differential Revision: D9393540
Pulled By: soumith
fbshipit-source-id: 8ebfed600fc735fd1cb0489b161ec80e3db062e0
Summary:
Fixes#10096
If the only thing preventing a simple mappable operator from being fused
into a fusion group is that its Tensor inputs are not of the same shape as the
output, then the graph fuser inserts explicit expand nodes for those
inputs.
This helps the graph fuser not miss out on any fusion opportunities
involving simple mappable operations that have Tensor inputs. This PR
doesn't do anything for the scalar case; that can be addressed later.
Test Plan
- Simple expect test case
- Added expect tests for a raw LSTMCell. The expands help speed up the
forwards pass by allowing more operations to be fused into the LSTMCell's single
FusionGroup.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10325
Differential Revision: D9379308
Pulled By: zou3519
fbshipit-source-id: 86d2202eb97e9bb16e511667b7fe177aeaf88245
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10630
`onnxTensorDescriptorV1.name` points to the string buffer. We use a vector of strings to serve as the storage. This means we cannot reallocate the vector because that may invalidate the `onnxTensorDescriptorV1.name` pointers. Solution is to reserve a large enough vector so that it won't reallocate.
Reviewed By: bddppq, houseroad
Differential Revision: D9381838
fbshipit-source-id: f49c5719aafcc0829c79f95a2a39a175bcad7bfe
Summary:
This is on the way to resolving #9940.
Fixes#10501
This PR modifies graph fuser to fuse operations that have constant
scalar arguments. These constant scalar arguments are directly inlined
into the kernel body.
The context for this is that LSTM backward (in particular, sigmoid
backward) has many add(x, 1.) operations. This PR should be sufficient for
LSTM backward to get fused by the graph fuser.
cc apaszke zdevito
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10511
Differential Revision: D9378896
Pulled By: zou3519
fbshipit-source-id: 6a7a2987f5b6e8edaaf4b599cd200df33361650f
Summary:
This is still not the final PR, but it removes all blockers for actually using the RNN functions directly in the JIT. Next patch should be final, and will actually remove the symbolic_override code, and change it to proper symbolics for those ATen functions. Turns out the symbolic code can be also cleaned up a bit, and I'll do that too.
zdevito ezyang
colesbury (for minor DispatchStub.h) changes
There was no way to handle those in the JIT for now, and they turned
out to be completely unnecessary. It should make the Python and C++
module code much simpler too, since all the logic is now centralized
in the native functions.
The downside is that RNN modules no longer own their dropout buffers,
which are shared per-device instead (with appropriate locking and
synchronization). This might appear as a perf regression at first, but
in reality it's highly unlikely that anyone will want to run cuDNN RNNs
on the same GPU in parallel.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10581
Reviewed By: colesbury
Differential Revision: D9365541
Pulled By: apaszke
fbshipit-source-id: 3ef8677ee5481bae60c74a9117a2508665b476b5
Summary:
The PR is the first step to integrate torch.nn library with JIT. It adds the tests for nn functional interfaces in trace/script mode, and tries to find out the different between torch.nn.functional ops and the ATen ops, to see the work need to be done in order to support a full set of nn functional in script mode.
Some statistics in summary:
- Totally 84 useful functions in torch.nn.functional (the number does not include helper funcs and deprecated funcs in torch.nn.functional).
- 7 functions/ops does not support higher gradient, so just excluded from the whole test.
- 36 functions is different with the Aten op for different reasons. Among those 36 functions, bunch of them (roughly around 10-15) are just naming difference and simple transformation using other ops inside the function.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10409
Differential Revision: D9350694
Pulled By: wanchaol
fbshipit-source-id: 8fce6f30d8d25ace5a544a57b219fe61f5a092f8
Summary:
Inlining if branches which have constant inputs. If an if node gets inlined, the set of mutated variables returned by its ancestors may have changed. In the following example the block should
return a mutated set of (a) and not (a, b).
```
if cond:
if True:
a = a - 1
else:
b = b - 1
```
To calculate this we recursively update mutate variables in if branches from the leaf nodes up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10084
Reviewed By: michaelsuo
Differential Revision: D9340429
Pulled By: eellison
fbshipit-source-id: b0dd638a5cace9fdec3130460428fca655ce4b98
Summary:
https://github.com/pytorch/pytorch/pull/10100 recently take external input/output in nomnigraph. This PR makes adjust to
0. Relax some of the conditions on external input
1. Update NNModule inputs/outputs when pruning the input/output.
2. Avoiding copying external input/output as nomnigraph already takes care of it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10598
Reviewed By: bwasti
Differential Revision: D9371730
Pulled By: yinghai
fbshipit-source-id: 9273be5041dc4cc8585587f47cb6721e518a06a8
Summary:
Custom python installation, which have no aliases to `python` or `python3` can't be found by cmake `findPythonInterp` without extra cmake argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10543
Differential Revision: D9378844
Pulled By: ezyang
fbshipit-source-id: 022e20aab7e27a5a56b8eb91b6026151116193c7
Summary:
Fix "error LNK2019: unresolved external symbol" from "CAFFE_KNOWN_TYPE" in tests where we should use dllexport instead of AT_CORE_API(=dllimport).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10602
Differential Revision: D9377394
Pulled By: Yangqing
fbshipit-source-id: 993062a461ffce393f2321c5391db5afb9b4e7ba
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10282
This diff removes the unused/deprecated features from the code base.
Reviewed By: manojkris
Differential Revision: D9169859
fbshipit-source-id: d6447b7916a7c687b44b20da868112e6720ba245
Summary:
This is the last step in the custom operator implementation: providing a way to build from C++ and Python. For this I:
1. Created a `FindTorch.cmake` taken largely from ebetica with a CMake function to easily create simple custom op libraries
2. Created a ` torch/op.h` header for easy inclusion of necessary headers,
3. Created a test directory `pytorch/test/custom_operator` which includes the basic setup for a custom op.
1. It defines an op in `op.{h,cpp}`
2. Registers it with the JIT using `RegisterOperators`
3. Builds it into a shared library via a `CMakeLists.txt`
4. Binds it into Python using a `setup.py`. This step makes use of our C++ extension setup that we already have. No work, yey!
The pure C++ and the Python builds are separate and not coupled in any way.
zdevito soumith dzhulgakov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10226
Differential Revision: D9296839
Pulled By: goldsborough
fbshipit-source-id: 32f74cafb6e3d86cada8dfca8136d0dfb1f197a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10599
Not spawning threads with spin-lock synchronization is bad because they will switch to `condvar` wait, which increases wake-up latency next time they are needed.
Reviewed By: ajtulloch
Differential Revision: D9366664
fbshipit-source-id: 3b9e4a502aeefaf0ddc4795303a855d98980b02e
Summary:
This commit adds the ``buffers()`` and ``named_buffers()`` methods as
analogues of ``parameters()`` and ``named_parameters()``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10554
Reviewed By: SsnL
Differential Revision: D9367762
Pulled By: jma127
fbshipit-source-id: f2042e46a7e833dce40cb41681dbd80d7885c74e
Summary:
A continuation of https://github.com/pytorch/pytorch/pull/10504 for GPU, torch, etc. builds.
I was testing with
```
FULL_CAFFE2=1 python setup.py build_deps | tee ~/log.txt
cat ~/log.txt | egrep 'undefined refer' | sort | less
```
I'll rebase on master when Yangqing's changes in 10504 land, but putting up for some testing.
cc mingzhe09088 anderspapitto ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10507
Reviewed By: Yangqing
Differential Revision: D9359606
Pulled By: orionr
fbshipit-source-id: c2a3683b3ea5839689f5d2661da0bc9055a54cd2
Summary:
Resubmit #10416 with fixed tests . This is to remove implicit conversion from gpu to cpu in when calling numpy to keep behavior match others.
It requires users to move the tensor back to cpu() before call numpy functions on it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10553
Differential Revision: D9350212
Pulled By: ailzhang
fbshipit-source-id: 9317d8fea925d4b20ae3150e2c1b39ba5c9c9d0a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10494
Adding the AllredubeBcube routines as they are now available in gloo.
Reviewed By: wesolwsk
Differential Revision: D8269473
fbshipit-source-id: 6a3a32291bbf1fbb328b3ced0f2a753dc5caf4e5
Summary:
The ONNXIFI backend will absorb the constant weight in Conv, so we should not add it as an input. This is just a test artifacts. Note that Onnxifi transformer will do the right thing when cutting the graph to absorb the weights.
rdzhabarov
Pull Request resolved: https://github.com/pytorch/pytorch/pull/10575
Reviewed By: houseroad
Differential Revision: D9357339
Pulled By: yinghai
fbshipit-source-id: a613fa3acafa687295312f5211f8e9d7f77b39cd