Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56984
This is a preparation PR before we can create CUDAFuture in rref_impl.cpp.
The solution is adding a `FutureFactoryRegistry` in `rpc/utils.*`. The
TensorPipe RPC agent is responsible for registering `CUDAFuture` factory
and `ivalue::Future` factory. The reason that we need this change instead
of directly using `USE_CUDA` macro in RRef files is as follows. There are
three build targets: `torch_cpu`, `torch_cuda`, and `torch_python`.
`torch_python` is built on top of the other two. `torch_cpu` is CPU-only,
which contains no CUDA-related code, and hence no `USE_CUDA` macro.
`tensorpipe_*` files are in `torch_python` which does have access to CUDA.
However RRef source files are in `torch_cpu`, which cannot contain CUDA
code. The recommended solution is to allow dynamic dispatching. Therefore,
we had this PR.
Test Plan: Imported from OSS
Reviewed By: lw
Differential Revision: D28020917
Pulled By: mrshenli
fbshipit-source-id: e67c76a273074aebb61877185cc5e6bc0a1a5448
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56444
Added out version for layer_norm
Test Plan:
buck test caffe2/aten:math_kernel_test -- NativeLayerNorm
buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest
Reviewed By: hlu1
Differential Revision: D27873846
fbshipit-source-id: 53ee9fec4ff9a4e78198b031e86b5afd013626dd
Summary:
Fixes https://github.com/pytorch/pytorch/issues/46702
- fails on probability distribution with odd items
- trying to access an `acc_type` (`float`) in a `scalar_t` (`float16`) aligned memory
- produce unrepeatable result for large input tensor
- parallel cumsum not monotonic at some positions
### Fixes
- computing cumsum on `acc_type` (`float`) instead of using `scalar_t` (`float16`) fixed both issues
- the non-monotonic behavior may happen even using `float`, though
- in these cases, deterministic behavior may be achieved by eliminating the race condition when writing the result, using the atomic function `atomicMax`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55364
Reviewed By: mruberry
Differential Revision: D28031666
Pulled By: ngimel
fbshipit-source-id: 0fc6289e0b9ea2d31ef3771e7ca370de8f5c02de
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56704
This is re submit of PR: https://github.com/pytorch/pytorch/pull/54175
Main changes compared to original PR:
- Switch to importing "<ATen/cuda/cub.cuh>"
- Use CUB_WRAPPER to reduce boiler plate code.
Test Plan:
Will check CI status to make sure a
Added unit test
Reviewed By: ngimel
Differential Revision: D27941257
fbshipit-source-id: 24a0e0c7f6c46126d2606fe42ed03dca15684415
Summary:
This PR tries to make the docs of `torch.linalg` have/be:
- More uniform notation and structure for every function.
- More uniform use of back-quotes and the `:attr:` directive
- More readable for a non-specialised audience through explanations of the form that factorisations take and when would it be beneficial to use what arguments in some solvers.
- More connected among the different functions through the use of the `.. seealso::` directive.
- More information on when do gradients explode / when is a function silently returning a wrong result / when things do not work in general
I tried to follow the structure of "one short description and then the rest" to be able to format the docs like those of `torch.` or `torch.nn`. I did not do that yet, as I am waiting for the green light on this idea:
https://github.com/pytorch/pytorch/issues/54878#issuecomment-816636171
What this PR does not do:
- Clean the documentation of other functions that are not in the `linalg` module (although I started doing this for `torch.svd`, but then I realised that this PR would touch way too many functions).
Fixes https://github.com/pytorch/pytorch/issues/54878
cc mruberry IvanYashchuk
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56265
Reviewed By: H-Huang
Differential Revision: D27993986
Pulled By: mruberry
fbshipit-source-id: adde7b7383387e1213cc0a6644331f0632b7392d
Summary:
According to `vecLib.framework/Headers/clapack.h` Accelerate.framework's LAPACK implementation is based on 3.2.1, and so LRWORK should be computed using following formula (from
```
*> If JOBZ = 'N', LRWORK >= 7*min(M,N).
*> Otherwise,
*> LRWORK >= min(M,N)*max(5*min(M,N)+7,2*max(M,N)+2*min(M,N)+1)
```
Found while looking at test_linalg.py crashes on M1, but would have happen on x86 as well, if Pytorch+Accelerate framework are to be tested on x86_64
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56847
Reviewed By: albanD
Differential Revision: D27983352
Pulled By: malfet
fbshipit-source-id: f757c515c85b32c1e09d00a91bc20fe4b390a75a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56824
This PR adds 6 dispatch uses to be used with prototyping.
I'm not sure what the best way to name these are, please let me know if
you think that these should have the same prefix.
Test Plan: - wait for tests
Reviewed By: driazati
Differential Revision: D27999963
Pulled By: zou3519
fbshipit-source-id: 0c3ef4788854f7a93d077cc454b773a6eedbbc22
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56945
In preparation to turn these on for CI
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: walterddr
Differential Revision: D28018454
Pulled By: seemethere
fbshipit-source-id: fa94d666499877f2cdd7b8fd3fc8b2d8127f61e8
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56941
Sets the custom test binaries we build in .jenkins/pytorch/build.sh to
be built in the `build` directory instead of the directory above the
workspace.
This should alleviate any weirdness we were seeing before with test
binaries having to be overwritten
Signed-off-by: Eli Uriegas <eliuriegas@fb.com>
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D28018453
Pulled By: seemethere
fbshipit-source-id: 74add11037a622e011d00fb6292bfe20e1d55d9e
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56797
After adding default seeding strategy for NumPy random module within each worker of DataLoader #56488, two concerns are raised:
- We dropped the support for NumPy < 1.17 due to `SeedSequence`
- In order to support seeding for NumPy < 1.17, how can we provide seed for `numpy.random`?
- First option is set the same seed as `random`. But, the problem is a same algorithm is shared between `numpy.random` and `random`. With the same seed, they will have exact same state sequence. Thanks to rkern, we noticed this so-called [bad things](https://github.com/PyTorchLightning/pytorch-lightning/pull/6960#issuecomment-818393659).
- Considering most of users do not aware this problem, we can provide a better seed by default for `numpy.random` using same `SeedSequence` algorithm as numpy. This is just a workaround with hard-coded function to generate an array of four int32 as the seed.
To better coping with this problem since there are amount of 3rd party libraries not just `NumPy` having random module. We may at the end need to implement a `SeedSequence` within `torch.random` module, then users can `spawn` a new `SeedSequence` for each library.
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D28000619
Pulled By: ejguan
fbshipit-source-id: 5701c8124a38ea5ded69eb8eee70f9680877ffa6
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55699
Todo:
- error message should be updated to say whether the failure is for fn's real or imaginary component
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D28007887
Pulled By: soulitzer
fbshipit-source-id: 1819201f59c8586a1d9631db05983969438bde66
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/55692
### Release notes
get_numerical_jacobian and get_analytical_jacobian only support `grad_out=1` and `fn` no longer accepts functions that return complex output
Test Plan: Imported from OSS
Reviewed By: H-Huang
Differential Revision: D28004614
Pulled By: soulitzer
fbshipit-source-id: 9592c9c69584b4035b39be62252f138dce39d3b5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56976
Band-aid fix for #54282
Signed-off-by: Edward Z. Yang <ezyang@fb.com>
Test Plan: Imported from OSS
Reviewed By: mruberry
Differential Revision: D28020401
Pulled By: ezyang
fbshipit-source-id: 50546d5275eade408d65e9c883999fb3b65ff55a
Summary:
Fixes https://github.com/pytorch/pytorch/issues/56243 by adding a note to mutating functions not following the trailing `_` convention in `torch/nn/modules/module.py`
I can also raise separate PRs for other files, if needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56877
Reviewed By: ezyang
Differential Revision: D28008856
Pulled By: jbschlosser
fbshipit-source-id: 63bfca0df05e49fceadd3167b1427dcb5542206a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56991
Original commit changeset: c5aa5f61a215
Diff: D27987746 (267b554b6f)
Test Plan: `buck test` under the glow-buck target is the target that this reversion is intended to fix
Reviewed By: jfix71
Differential Revision: D28019659
fbshipit-source-id: 37584ff404fc9195b309a5a6afdb4edbc2b4f088
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56816
This doesn't actually work. For some reason the linker can't find
at::cpu::logit_out, and it's not worth digging into why not.
Test Plan: Imported from OSS
Reviewed By: ZolotukhinM
Differential Revision: D27977406
Pulled By: bertmaher
fbshipit-source-id: d0235a393f25243e2c8a011e9baf267daf483ae4
Summary:
Adding cuda synchronization when entering and exiting the profiler
context manager
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56651
Test Plan: CI
Reviewed By: gdankel
Differential Revision: D27926270
Pulled By: ilia-cher
fbshipit-source-id: 5cf30128590c1c71a865f877578975c4a6e2cb48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56717
The signal_handler was under the caffe2 namespacee but was being used
by PyTorch as well.
I've fixed this my moving it to the c10 namespace where now both C2 and PyTorch
can use it.
The signal_handler interface in caffe2/utils/signal_handler.h is kept the same
for backward compatiblity for C2, but most of the commmon code is moved to c10.
ghstack-source-id: 127446929
Test Plan: waitforbuildbot
Reviewed By: ezyang
Differential Revision: D27946738
fbshipit-source-id: d6228d1a0108f4c807d405e7a0bb799c5375388f
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56908
CUDA channels might implement CPU-to-CPU transfers, but will usually be
less efficient for that purpose.
Test Plan: CI
Reviewed By: lw
Differential Revision: D27994069
fbshipit-source-id: fefa7f243eb43cf769864233df518f2a1819f949
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56812
fb::equally_split get fused with ListUnpack and all outputs from ListUnpack getting attached to fb::equally_split.
So fb::equal_split will have as many outputs as ListUnpack .
Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators
buck test caffe2/torch/fb/sparsenn:test -- test_equally_split_op
Reviewed By: hlu1
Differential Revision: D27974999
fbshipit-source-id: b2ca19ff86aec76b977c1e3cfc56567adab66b35
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/56943
If the module is placed on a CUDA device, then all the CPU tensors in `args` and `kwargs` will also be implicitly moved to the same CUDA device to run forward.
Currently still need to move the forward output from CUDA device back to CPU, until:
1) Process group RPC backend is completely deprecated, and we always use TensorPipe RPC backend;
2) A device map is explicitly provided to TensorPipe RPC backend.
These steps will be done in a separate PR.
#Original PR issue: https://github.com/pytorch/pytorch/issues/51670
ghstack-source-id: 127457584
Test Plan:
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- test_input_moved_to_cuda_device_script
buck test mode/dev-nosan caffe2/test/distributed/rpc:process_group_agent -- RemoteModule
buck test mode/dev-nosan //caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test -- --exact 'caffe2/torch/fb/training_toolkit/applications/sparse_nn/batch_distributed_inference/tests:batch_distributed_inference_test - test_load_di_parts (caffe2.torch.fb.training_toolkit.applications.sparse_nn.batch_distributed_inference.tests.batch_distributed_inference_test.BatchDistributedInferenceTest)'
Reviewed By: wanchaol
Differential Revision: D27934791
fbshipit-source-id: de27e27b905db83cc52800e63684fc6c942e9dc7
Summary:
Fixes https://github.com/pytorch/pytorch/issues/48141
~Mypy is complaining about a missing arg in a function call.~
```bash
torch/backends/_nnapi/serializer.py:806: error: Too few arguments for "_do_add_binary" [call-arg]
Found 1 error in 1 file (checked 1140 source files)
```
9392137dbe/torch/backends/_nnapi/serializer.py (L804-L806)
~dreiss, would you mind take a look when you have some cycles to spare and see what would be the appropriated value for `fuse_code` here? Thanks :)~
Edit: https://github.com/pytorch/pytorch/issues/48925 got merged a couple of days ago. The blocking part is now unblocked, and I just pushed the changes to make mypy happy again. This PR is ready for review.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48142
Reviewed By: ezyang
Differential Revision: D28006249
Pulled By: walterddr
fbshipit-source-id: 5e43eeba7143512a549efaad31541f86718add7c