Commit graph

1260 commits

Author SHA1 Message Date
Sameer Deshmukh
d100d98db8 torch.linalg routines return torch.linalg.LinAlgError when a numerical error in the computation is found. (#68571)
Summary:
This PR fixes https://github.com/pytorch/pytorch/issues/64785 by introducing a `torch.LinAlgError` for reporting errors caused by bad values in linear algebra routines which should allow users to easily catch errors caused by numerical errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68571

Reviewed By: malfet

Differential Revision: D33254087

Pulled By: albanD

fbshipit-source-id: 94b59000fdb6a9765e397158e526d1f815f18f0f
2021-12-23 10:53:26 -08:00
Michael Suo
23ab6ce723 Revert D33141011: extract //c10/macros into its own package
Test Plan: revert-hammer

Differential Revision:
D33141011 (8f4c724bb6)

Original commit changeset: caa97448f922

Original Phabricator Diff: D33141011 (8f4c724bb6)

fbshipit-source-id: 79423ed51f9a43ecf1f716a739c74949b66fadb4
2021-12-22 17:48:45 -08:00
Michael Suo
f126501d37 Revert D33141010: allow Bazel to build without glog and gflags
Test Plan: revert-hammer

Differential Revision:
D33141010 (8c41f258f4)

Original commit changeset: d951e5616459

Original Phabricator Diff: D33141010 (8c41f258f4)

fbshipit-source-id: d52ca20ddf4c5a91cb09a32fecb30a00227fc4ae
2021-12-22 17:47:23 -08:00
Michael Dagitses
8c41f258f4 allow Bazel to build without glog and gflags (#69995)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69995
ghstack-source-id: 146027060

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33141010

fbshipit-source-id: d951e5616459e8aa163ae0741e245f53185580e8
2021-12-22 14:30:30 -08:00
Michael Dagitses
8f4c724bb6 extract //c10/macros into its own package (#69994)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69994
ghstack-source-id: 145799968

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D33141011

fbshipit-source-id: caa97448f922d7c12980bf01669c1b3ef5c1213b
2021-12-22 14:30:27 -08:00
Michael Dagitses
02c63c3006 extract out c10 targets to the c10 package (#69992)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69992

Test Plan: Imported from OSS

Reviewed By: mruberry

Differential Revision: D33141013

fbshipit-source-id: e5edd6bd5b5834ac27390ba940ebed9148512c8d
2021-12-16 13:11:49 -08:00
CodemodService FBSourceClangFormatLinterBot
f7210f8d90 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D33090919

fbshipit-source-id: 78efa486776014a27f280a01a21f9e0af6742e3e
2021-12-14 08:06:58 -08:00
Peter Bell
b08d64202a Remove THGeneral (#69041)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69041

`TH_CONCAT_{N}` is still being used by THP so I've moved that into
it's own header but all the compiled code is gone.

Test Plan: Imported from OSS

Reviewed By: anjali411

Differential Revision: D32872477

Pulled By: ngimel

fbshipit-source-id: 06c82d8f96dbcee0715be407c61dfc7d7e8be47a
2021-12-13 16:14:28 -08:00
Nikita Shulga
59deee8308 Make c10 tests compilable with -Werror (#69711)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69711

Test Plan: Imported from OSS

Reviewed By: r-barnes

Differential Revision: D32997005

Pulled By: malfet

fbshipit-source-id: 369194051ece9d213b48584ca84e5d76b3794dae
2021-12-10 16:47:46 -08:00
Scott Wolchok
d026057bb3 [PyTorch] Update SmallVector from LLVM (#69110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69110

I pasted the current LLVM code, reapplied the modifications listed in the code comments, caught a few more in the diff/build process. The trivially copyable detection is different now; if gcc builds fail, will try reverting to C10_IS_TRIVIALLY_COPYABLE or copying what LLVM is doing.

The motivation for this change is that, as noted in an existing comment, C10_IS_TRIVIALLY_COPYABLE did the wrong thing for std::unique_ptr, which caused problems with D32454856 / #68412.

ghstack-source-id: 145327773

Test Plan: CI

Reviewed By: bhosmer, mruberry

Differential Revision: D32733017

fbshipit-source-id: 9452ab90328e3fdf457aad23a26f2f6835b0bd3d
2021-12-10 11:57:19 -08:00
Richard Barnes
29d759948e use irange for loops 2 (#66746)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66746

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31705361

fbshipit-source-id: 33fd22eb03086d114e2c98e56703e8ec84460268
2021-12-10 04:26:23 -08:00
CodemodService FBSourceClangFormatLinterBot
015e481a41 [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D32975574

fbshipit-source-id: 66856595c7bc29921f24a2c5c00c72892f262aa1
2021-12-09 00:10:33 -08:00
anjali411
3e6164449f Add efficient zero tensors (#64837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64837

Test Plan: Imported from OSS

Reviewed By: gchanan

Differential Revision: D32834987

Pulled By: anjali411

fbshipit-source-id: 20ea08ade0db0044ca633d9c1a117a6a2e65d1fd
2021-12-08 10:37:39 -08:00
Nelson Elhage
a813ddf5ec CUDACachingAllocator: make an error message more accurate. (#69174)
Summary:
The `TORCH_CHECK` asserts for strictly-greater-than `kLargeBuffer`,
but the exception claims `>=`. Fix the error message to match the
code.

Happy to open an issue if it's helpful; I was hopeful the trivial fix doesn't need a separate issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69174

Reviewed By: zou3519

Differential Revision: D32760055

Pulled By: H-Huang

fbshipit-source-id: 1a8ab68f36b326ed62d78afdcb198f4d6572d017
2021-12-03 15:04:59 -08:00
Mark Richardson
834bd3134e Back out "Add efficient zero tensors" (#69327)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/69327

Original commit changeset: d44096d88265

Original Phabricator Diff: D32144240 (668574af4a)

Test Plan:
CI

original diff failed 175 builds in CI

Reviewed By: airboyang, anjali411

Differential Revision: D32809407

fbshipit-source-id: c7c8e69bcee0274992e2d5da901f035332e60071
2021-12-02 19:11:41 -08:00
Michael Carilli
572c3e3118 Fix some usages of CUDA_VERSION (#69092)
Summary:
See https://pytorch.slack.com/archives/G4Z791LL8/p1638229956006300

I grepped c10, aten, and torch for CUDA_VERSION and checked the usages I saw.
I can't guarantee I made a clean sweep. but this improves the status quo.

cc ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/69092

Reviewed By: zou3519

Differential Revision: D32786919

Pulled By: ngimel

fbshipit-source-id: 1d29827dca246f33118d81e136252ddb5bf3830f
2021-12-02 18:32:47 -08:00
Peter Bell
33c3c539b6 THPStorage: Prefer intrusive_ptr over owning raw pointers (#69248)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69248

Reviewed By: zou3519

Differential Revision: D32771035

Pulled By: ngimel

fbshipit-source-id: cf9bbcc5563ae9715ecf13631ba56c32240e59e3
2021-12-02 16:33:03 -08:00
anjali411
668574af4a Add efficient zero tensors (#64837)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/64837

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D32144240

Pulled By: anjali411

fbshipit-source-id: d44096d882657c7f9270a16636900e0b73cefa40
2021-12-02 08:47:45 -08:00
Hao Lu
ed3b73fd4d [Static Runtime] Skip ProcessedNode:: verify_no_memory_overlap() for out variants (#68639)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68639

Fix all problems related to `ProcessedNode:: verify_no_memory_overlap()`
- Only enable this check for native and fallback ops that are not inplace or view ops
- Enable ProcessedNode:: verify_no_memory_overlap() in debug mode and enforce it
- Add gflag --static_runtime_disable_debug_memory_overlap_check to test the runtime memory overlap fix for bad schemas

fb::expand_dims's schema was not correct after this check is re-enabled. It's fixed in D32556204 (39ab417107)

Reviewed By: mikeiovine

Differential Revision: D32553708

fbshipit-source-id: 88de63cdf1ee4f87b7726c8b65a11a5fb8a99d13
2021-12-02 05:03:12 -08:00
Peter Bell
f7d598948a Remove native_functions.yaml dependency from TensorModeKernel.cu (#66913)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/66913

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D31856102

Pulled By: dagitses

fbshipit-source-id: 8888a1984adef09104a40ae683d091143cd1f4fa
2021-11-30 04:22:09 -08:00
Kurt Mohler
d9e7d85390 Remove TH/THC Storage (#68556)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/67852

cc ezyang bhosmer smessmer ljk53 bdhirsh

Pull Request resolved: https://github.com/pytorch/pytorch/pull/68556

Reviewed By: ejguan

Differential Revision: D32652758

Pulled By: ngimel

fbshipit-source-id: 170956fca112606f9008abe09b92c6ddc411be09
2021-11-29 12:55:20 -08:00
Dhruv Matani
cb2a41e508 [PyTorch Edge] Don't use LeftRight in mobile (#66064)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66064

The only place this is used seems to be in the dispatcher for `operatorLookupTable_`. Disarming `LeftRight` disarms it for this one use case.

This should make .so loading faster, and also reduce memory consumption since `LeftRight<T>` does 2 writes for every write. I'd like to get a thorough review from reviewers for this diff since I want to make sure that initialization of stuff that writes into the dispatcher isn't going to happen on multiple threads for on-device use.

Created a new class named `LeftRightNoOpWrapper<T>` for use in mobile builds.

### Why is LeftRight<T> slow?

It maintains 2 copies of each data structure `T` to be able to keep reads quick. Every write goes to both data structures, which means that writes that 2x and memory overhead is also 2x

### Why is this safe for mobile builds?

1. .so loading never happens concurrently with model execution
2. Custom ops are loaded during .so load - initializers are all run serially
3. I don't see any threads being spawned from the global schema and kernel initializers

After discussing with dreiss, it seems like there could be rare cases in OSS apps or internal Android/iOS apps where a `.so` or `dylib` is loaded after the PT runtime is loaded, and this load happens concurrently with an in-progress inference run, which is looking up the operator table in the dispatcher.

To avoid crashes there, it seems reasonable to use the RW lock, since I don't expect any contention 99.9% of the time.

When registering operators, everything is serial so only one thread will ever hold the lock. The next time it needs the lock, it will have already released it.
During inference runs, only one thread will ask for the shared lock unless multiple concurrent inferences are in progress. Even in that case, they will all be able to simultaneously get the Read lock.

Test Plan: Build and generate a local build of the iOS app to test.

Reviewed By: swolchok

Differential Revision: D31352346

fbshipit-source-id: c3f12454de3dbd7b421a6057d561e9373ef5bf98
2021-11-09 21:49:45 -08:00
Scott Wolchok
b0c05297f9 [Static Runtime] Arena allocate StorageImpls for managed tensors (#66130)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66130

We're reusing backing storage for these tensors, which is only safe because they have non-overlapping lifetimes. Accordingly, it seems that they can also share their StorageImpl.

ghstack-source-id: 142427752

Test Plan:
benchmarked ctr_mobile_feed local and local_ro:

Using recordio inputs for model 302008423_0

```
swolchok@devbig032 ~/f/fbcode> env MKL_NUM_THREADS=1 OMP_NUM_THREADS=1  > environment^C
swolchok@devbig032 ~/f/fbcode> sudo ~/fbsource2/fbcode/scripts/bertrand/noise/denoise-env.sh \
                                 /tmp/ptvsc2_predictor_benchNov1ArenaAllocateStorageImpls \
                               --scripted_model=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.predictor.disagg.local \
                               --method_name=local.forward --pt_cleanup_activations=1 \
                               --pt_enable_out_variant=1 --pt_optimize_memory=1 --iters=2 --warmup_iters=2 \
                                      --num_threads=1 --pt_enable_static_runtime=1 --set_compatibility=1 --repetitions=5 --recordio_use_ivalue_format=1 --recordio_inputs=/data/users/swolchok/ctr_mobile_feed_q3_2021/302008423_0.local.inputs.recordio

Stable
========================================
I1101 14:19:16.473964 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0131. Iters per second: 49.9673
I1101 14:20:12.193130 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 20.0155. Iters per second: 49.9612
I1101 14:21:07.761898 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9751. Iters per second: 50.0624
I1101 14:22:03.218066 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9104. Iters per second: 50.2249
I1101 14:22:58.723256 2748837 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.956. Iters per second: 50.1102
I1101 14:22:58.723306 2748837 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.974, standard deviation: 0.043643

ArenaAllocateStorageImpls
========================================
I1101 14:08:57.070914 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9771. Iters per second: 50.0572
I1101 14:09:52.605121 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.924. Iters per second: 50.1907
I1101 14:10:48.098287 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9353. Iters per second: 50.1624
I1101 14:11:43.645395 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9723. Iters per second: 50.0694
I1101 14:12:39.171636 2695478 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 19.9673. Iters per second: 50.0819
I1101 14:12:39.171685 2695478 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 19.9552, standard deviation: 0.0239318

difference: 0.0188 (0.09%), which is less than 1 standard deviation

Stable, local_ro
========================================
I1101 14:26:10.796161 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25991. Iters per second: 793.708
I1101 14:26:12.194727 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26862. Iters per second: 788.26
I1101 14:26:13.591312 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.26549. Iters per second: 790.207
I1101 14:26:14.982439 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25943. Iters per second: 794.01
I1101 14:26:16.377033 2787930 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.25995. Iters per second: 793.68
I1101 14:26:16.377094 2787930 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.26268, standard deviation: 0.00414788

ArenaAllocateStorageImpls, local_ro
========================================
I1101 14:26:45.875073 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20987. Iters per second: 826.536
I1101 14:26:47.207271 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20827. Iters per second: 827.633
I1101 14:26:48.533766 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.20023. Iters per second: 833.174
I1101 14:26:49.850610 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19206. Iters per second: 838.884
I1101 14:26:51.172356 2790009 PyTorchPredictorBenchLib.cpp:251] PyTorch run finished. Milliseconds per iter: 1.19958. Iters per second: 833.622
I1101 14:26:51.172411 2790009 PyTorchPredictorBenchLib.cpp:262] Mean milliseconds per iter: 1.202, standard deviation: 0.00722754

Difference: 0.06 usec/iter (4.8%), which is much more than 1 standard deviation

```

we can see that this is a large relative improvement on local_ro, but no effect on local.

Reviewed By: hlu1

Differential Revision: D31357486

fbshipit-source-id: 229c003677da76e89c659d0e0639002accced76e
2021-11-04 15:43:39 -07:00
Richard Barnes
a122ba776a Fix less_than_lowest warnings (#67422)
Summary:
Fixes useless comparison against zero warnings for Half.h

Pull Request resolved: https://github.com/pytorch/pytorch/pull/67422

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31951939

fbshipit-source-id: 3e9940adda2d57b4d9b122f3862706c673f9ef4b
2021-11-01 11:19:55 -07:00
Brian Hirsh
0032fa7725 Add a Functionalization pass in core (#64432)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/64432

Original PR description + feedback here: https://github.com/pytorch/pytorch/pull/63048

I've addressed all of the feedback in the original PR and made some pretty large changes, listed below.

**Table of Contents**
- Starting points
- List of the main changes from the original PR
- Next Steps
- Example codegen output (for a view, mutation, and view+mutation op)

**Starting Points**

A good place to start when looking through the PR:
* Alban mentioned that this is a useful mental model (thanks Ed for originally making this clear to me). Semantically, the pass currently does THREE things, which are all needed by functorch - all fused together into one big pass.
  * (a) alias removal, which replaces {view} calls with {view}_copy calls, and manually tracks aliasing information, so that when one tensor is mutated, we re-apply the same mutation to all of the aliases. This is the bulk of the work - once this is done, the next 2 things are trivial to implement.
  * (b) mutation removal, which is easy to do once we know that there are no aliases. Every mutation `a.add_(b)` becomes `a.replace_(a.add(b))`
  * (c) reapplying views: all of the `{view}_copy` calls are replaced with `{view}` calls again. This is an optimization that we can make specifically for functorch (and strided backends), that only care about mutation removal and not alias removal
  * XLA and Vulkan only want (a), or (a) + (b). Later, we'll want to split this out so that you can actually opt into different versions of this logic.
  * There is currently no {view}_copy replacement, because the pass just <replace views with copies> and <replace copies with views> steps have been combined. Later, we'll want to actually implement {view}_copy variants of each view operator, probably with codegen.
* documentation breadcrumb 1, in `FunctionalTensorWrapper.cpp`: https://github.com/pytorch/pytorch/pull/64432/files#diff-a0bac99bf205dba5b94cb64fc2466d3d55d991887572f9cd6a02e27b3a91dd60R59 (you might have to expand the `FunctionalTensorWrapper.cpp` file, which GitHub closes by default because it's large)
* documentation breadcrumb 2, in `FunctionalTensorWrapper.h`: https://github.com/pytorch/pytorch/pull/64432/files#diff-c945c71a4ccac65871f24a912e8904f9a5088b24a32e636727ea9c8fe920708aR12
* Reading through the codegen output at the bottom of this description.

**Main changes from the original PR**

(1)  I use lambdas instead of a giant enum to handle all of the different views.

This results in less boilerplate per view op (and more stuff that can be codegen'd). Every `ViewMeta` object now contains a `forward` and `reverse` lambda, that knows how to replay the view and its inverse. This makes the actual code that executes the replaying logic a lot less boilerplate-y (see `Alias::sync_update_operations` and `FunctionalTensorWrapper::sync_`)

(2) Every tensor during the functionalization pass is always wrapped in a `FunctionalTensorWrapper`.

This is potentially unnecessary for Vulkan/XLA, and will have a mild perf impact, but for now this PR just targets the functorch use case. I previously had a complicated design a (`FunctionalTensorImplBase` class) to avoid needing the wrapper for XLA, but it had some subtleties that are gonna require more thought to fix, so I'm pushing that off for now.

(3) `FunctionalTensorWrapper` objects accurately report stride information.

It's a little annoying to do this though, because the logic that calculates stride info for each view isn't easily separated from the actual view kernels in core, `at::native::{view}`. I do this by adding logic in each `at::functionalization::{view}` kernel to call the reference implementation `at::native::{view}`. I don't do anything with the output aside from taking it's size/stride/storage_offset to set the actual output tensor's size/stride/storage_offset correctly. There's another annoying part to this: I'm pretty sure that we want to pass in the actual *wrapper* tensors directly into the native kernels, not their inner unwrapped values. But there are some `at::native::{view}` kernels that call other tensor methods, which re-invokes the dispatcher, calling functionalization/functorch kernels that try do the unwrapping.

To do this, right now I have an `AutoDispatchDirectlyToNative` guard that basically ensures that any tensor methods called inside of the at::native::{view} op always redispatch straight to the CPU kernel (which will be another at::native:: kernel). This feels kind of heavy handed, but I'm not sure of a better way to do it.

(4) `FunctionalTensorWrapper` objects accurately report aliasing information.

There's a new `FunctionalStorageImpl` class (subclass of `StorageImpl`) that allows tensors in the functionalization pass to accurately alias storage. If two tensors `a` and `b` in a functionalized program are views of one another, then `a.storage.is_alias_of(b.storage)` should return true. I added this in a pretty similar way to how meta tensors allocate storage, although I don't pass in an actual allocator (I think this is fine because you should never resize a functional tensor's storage).

One thing I'm not sure about - should `FunctionalTensorWrapper` set `storage_access_should_throw_`: (a) always, (b) never, (c) only if its wrapped tensor has it set.

Right now I have it not set, mostly because calling the reference view functions (`at::native::{view}`) requires looking at the storage. But that means that if you try to access storage from python in a functionalized program, you'll get silent garbage instead of an error. Related question: are we planning on exposing meta tensor storage to python in the future (even though it contains garbage)?

(5) better docs :)

**View operator coverage**

(6) The functionalization pass now gets math-composite view ops for free.

I didn't add the `Functionalize` dispatch key to the composite set, because I don't want composite ops like `torch.ones` to get decomposed before hitting the functionalization pass. Instead, I added codegen to manually register the `at::native::` kernels of composite view ops. This is a little hairy, because the names of the `at::native::` kernels aren't easily accessible. They're stored in a `Dict[DispatchKey, BackendIndex]`. I made a best-effort attempt to get each view kernel's name, basically by assuming that every view op has either a composite or cpu implementation.
There's also a hardcoded list of composite view ops in `gen_inplace_or_view_type.py`, but it looks like it's wrong. This is probably worth rationalizing later, but instead I created a new list of the "complete" set of composite view ops, and preserved the old set by hardcoding the delta between the two sets.

(7) I've added codegen for ops that are both views AND mutations, like `transpose_()` (why do we even have these {emoji:1f622}).

From some light testing, it looks like they work correctly with one caveat: I had a hard time ensuring that functorch programs that mutate their inputs using ops like `transpose_()` preserve the input mutations after the program finishes running. For (in my corresponding functorch branch) I emit a warning when this happens, and just don't preserve the mutation

(8) I added `{view}_inverse` implementations for every view op, in `FunctionalInverses.cpp`.

These are needed to take mutations made to views and replay them back onto the base. To reduce boilerplate, the codegen generates function declarations for each `{view}_inverse` function, so you get a nice compiler error when someone eventually adds a new view op.

The only view ops currently not supported are (a) as_strided, and (b) the sparse view ops (values()/indices()).

I can add support for as_strided, but it needs an `as_strided_inverse()` function. That will look really similar to the `as_strided_backward()` function in FunctionsManual.cpp, but it has some noticeable differences: we basically want an `as_strided_embed` for autograd and `as_strided_scatter` for functionalization. We also will probably need them to be primitives w.r.t to autograd, since the currently implementation for autograd uses view().copy_() calls that XLA won't be able to handle. I'm wondering if anyone has any objections, but otherwise I can make those change (which will require writing backward formulas for `as_strided_embed` and `as_strided_scatter`).

I did a bunch of manual testing that all looks pretty good, but it's definitely not fully tested. Ed pointed out that once XLA uses this pass (or at least once there's a POC), we can just run the existing xla view test suite. Hopefully that delay is okay - if it's not, maybe we can think about using OpInfos similar to how functorch uses them for testing.

Note: there's some duplication with autograd's view code. Every `{view}_inverse` implementation is really similar to the implementation for that view listed in `derivatives.yaml`. There are some major differences though:
* the autograd implementations over those backwards functions (like `permute_backwards()`, in `FunctionsManual.cpp`) internally call other view ops. For functoinalization, we want them to (eventually call `{view}_copy` operators).
* For view ops that take a subset of the original storage, like `slice/select/diagonal/as_strided()`, the autograd backward functions fill the "spaces" in the inverse call with zeroes. For functionalizations, we want to fill them with the value of `base` at those positions. It looks like this currently applies to 6 total ops (since we can ignore composites):
  * select
  * slice
  * diagonal
  * as_stridied
  * split
  * split_with_sizes
A nice end state would probably be for the autograd + functoinalization codegen to both look at the same yaml (either `derivatives.yaml`, or something else), and automatically generate the right thing. I didn't leave that in scope for this PR though.

**Current State + Next Steps**

There are a bunch of followups after this PR eventually lands. Roughly in order:
* Use the current pass to register problematic composite ops in functorch. Also, nested `functionalize()` calls aren't supported yet (I mostly just need to remove some debug asserts and test it).
* Work on freeing up dispatch key space in the by deduplicating the `{backend}`/`Autograd{backend}`/`Sparse{backend}`/`Quantized{backend}` keys
* Once we have more dispatch keys, split up this pass into 3 pieces - it's currently fused, and doesn't do the right thing for vulkan/XLA. Specifically, all of the `{view}` calls in the current pass's view-replay logic should turn into `{view}_copy` calls that vulkan/XLA know how to implement, and there will be separate passes for (a) removing mutations, and (b) turning `{view}_copy` calls back into `{view}` calls. For Vulkan, we eventually want a pass that ONLY removes aliasing and view calls, and doesn't remove mutations. We can also probably make the 2 new passes user dispatch keys to save dispatch key space, if they'll only be used by functorch anyway.
* Do more of a dive on perf for the vulkan/xla use cases. There are several areas to improve perf with varying levels of effort required. The simplest one that I'll probably do regardless is to codegen the out-of-place kernels instead of using a boxed fallback. Getting a POC working for xla will also be useful to test the view operator coverage.

**Example Codegen Output**

View Op:
```
::std::vector<at::Tensor> split_Tensor(c10::DispatchKeySet ks, const at::Tensor & self, int64_t split_size, int64_t dim) {

      auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
      ::std::vector<at::Tensor> out;
      {
        at::AutoDispatchBelowFunctionalize guard;
        auto tmp_output = at::redispatch::split(ks & c10::after_func_keyset, self_, split_size, dim);
        out = at::functionalization::impl::wrapFunctionalTensor(tmp_output);
        // I'm fusing the [alias removal], [mutation removal], [add views back] passes together.
        // Later, we'll want to turn them into separate passes (since e.g. vulkan only cares about alias removal).
      }

      at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
        [split_size, dim](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
          return base.split(split_size, dim)[mutated_view_idx];
        },
        [split_size, dim](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
          return at::functionalization::impl::split_inverse(base, mutated_view, mutated_view_idx, split_size, dim);
        }
      );
      at::functionalization::impl::set_view_meta(out, self, view_meta);

      at::AutoDispatchDirectlyToNative native_guard;
      ::std::vector<at::Tensor> reference_tensor_output = at::native::split(self, split_size, dim);
      at::functionalization::impl::set_strides(out, reference_tensor_output);
      return out;

}
```

Mutation Op:
```
at::Tensor & add__Tensor(c10::DispatchKeySet ks, at::Tensor & self, const at::Tensor & other, const at::Scalar & alpha) {

      at::functionalization::impl::sync(self);
      at::functionalization::impl::sync(other);
      auto self_ = at::functionalization::impl::unwrapFunctionalTensor(self);
      auto other_ = at::functionalization::impl::unwrapFunctionalTensor(other);
      at::Tensor tmp_output;
      {
          at::AutoDispatchBelowFunctionalize guard;
          // The functionalization pass explicitly doesn't pass out= parameters to the redispatch
          tmp_output = at::redispatch::add(
            ks & c10::after_func_keyset, self_, other_, alpha);
      }

      self.replace_(tmp_output);
      at::functionalization::impl::maybe_add_update(self);
      return self;
}
```

View + Mutation Op:
```
at::Tensor & transpose_(c10::DispatchKeySet ks, at::Tensor & self, int64_t dim0, int64_t dim1) {

      at::functionalization::ViewMeta view_meta = at::functionalization::ViewMeta(
        [dim0, dim1](const at::Tensor& base, int64_t mutated_view_idx) -> at::Tensor {
          return base.transpose(dim0, dim1);
        },
        [dim0, dim1](const at::Tensor& base, const at::Tensor& mutated_view, int64_t mutated_view_idx) -> at::Tensor {
          return at::functionalization::impl::transpose_inverse(base, mutated_view, dim0, dim1);
        }
      );
      at::functionalization::impl::mutate_view_meta(self, view_meta);
      // See  Note [Propagating strides in the functionalization pass]
      // Directly update the sizes/strides/storage_offset fields on self using the inplace call.
      // I need the guard because I don't want the at::native kernel to end up calling more functionalization/functorch kernels.
      // Its only job is to directly compute the output size/stride/storage_offset metadata.
      at::AutoDispatchDirectlyToNative native_guard;
      at::native::transpose_(self, dim0, dim1);
      return self;

}
```

Test Plan: Imported from OSS

Reviewed By: albanD

Differential Revision: D31942093

Pulled By: bdhirsh

fbshipit-source-id: b95598dae35dd1842fa8b1d8d1448332f3afaadf
2021-10-28 10:51:17 -07:00
Richard Barnes
9900310133 Fix sign warnings in CUDA kernels (#66753)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66753

Fixes these Wextra compilation errors:
```
stderr: caffe2/aten/src/ATen/native/cuda/UnarySignKernels.cu: In lambda function:
caffe2/aten/src/ATen/native/cuda/UnarySignKernels.cu:49:72: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   49 |   AT_DISPATCH_ALL_TYPES_AND2 (44fd312604)(kBFloat16, ScalarType::Half, iter.input_dtype(), "signbit_cuda", [&]() {
      |                                                                      ~~^~~
stderr: caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu: In lambda function:
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu:99:86: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   99 |     AT_DISPATCH_INTEGRAL_TYPES(dtype, "div_floor_cuda", [&]() {
      |                                                                                      ^
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu:99:97: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   99 |     AT_DISPATCH_INTEGRAL_TYPES(dtype, "div_floor_cuda", [&]() {
      |                                                                                                 ^
stderr: caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu: In lambda function:
caffe2/aten/src/ATen/native/cuda/BinaryMulDivKernel.cu:99:86: error: comparison is always false due to limited range of data type [-Werror=type-limits]
   99 |     AT_DISPATCH_INTEGRAL_TYPES(dtype, "div_floor_cuda", [&]() {
      |                                                                                      ^
```
And also these warnings:
```
caffe2/c10/util/Half.h(461): warning: pointless comparison of unsigned integer with zero
          detected during instantiation of "std::enable_if<<expression>, __nv_bool>::type c10::overflows<To,From>(From) [with To=size_t, From=unsigned long]"
caffe2/aten/src/ATen/native/Resize.h(45): here
caffe2/c10/util/Half.h(459): warning: pointless comparison of unsigned integer with zero
          detected during instantiation of "std::enable_if<<expression>, __nv_bool>::type c10::overflows<To,From>(From) [with To=size_t, From=unsigned long]"
caffe2/aten/src/ATen/native/Resize.h(45): here
```
I thought I'd fixed this previously using `std::is_unsigned` in D25256251 (cff1ff7fb6), but apparently that was insufficient.

Test Plan: Sandcastle

Reviewed By: malfet, ngimel

Differential Revision: D31708173

fbshipit-source-id: 7714f6bbf109d2f2164630d3fc46bad18046c06c
2021-10-27 13:39:27 -07:00
Horace He
4fe8055b9f made functorch not decompose by default (#66945)
Summary:
Basically reverting this: https://github.com/pytorch/pytorch/pull/63616

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66945

Reviewed By: zou3519

Differential Revision: D31802176

Pulled By: Chillee

fbshipit-source-id: b1cabd7af66aef26411801516c87336eaea4fccb
2021-10-21 19:18:00 -07:00
andrewor
e046386be8 Avoid inlining error reporting in checked_convert (#66721)
Summary:
**Summary:** Move the error reporting part to the cpp file to avoid callers inlining it, which inflates the generated code size. See https://github.com/pytorch/pytorch/issues/65830.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66721

Test Plan:
Compiling the simple program below now generates ~150 lines of assembly, compared to 700+ lines before.

```
#include <c10/core/Scalar.h>

void g(float) {}

void f(const c10::Scalar& scalar) {
    auto x = scalar.to<float>();
    g(x);
}
```

**Reviewers:** Brian Hirsh

**Subscribers:** Brian Hirsh, Edward Yang, Yining Lu

**Tasks:** T103384490

**Tags:** pytorch

Fixes https://github.com/pytorch/pytorch/issues/65830

Reviewed By: zou3519, bdhirsh

Differential Revision: D31737607

Pulled By: andrewor14

fbshipit-source-id: 3d493c4d8e51d8f8a19d00f59b8ea28176c8a9e3
2021-10-20 16:04:09 -07:00
Giuseppe Ottaviano
72803dbcfd [caffe2] Fix invalid vector accesses and polar() call (#66757)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66757

`InterpreterStateImpl::run()` gets the number of outputs from the current frame, but by the time the continuation completes, the frame is gone, so we're calling `front()` on an empty vector. This works out in practice (data is still there) but it is technically undefined behavior and could break in the future.

Also, `std::polar()` expects its argument to be non-negative, but `c10::polar()` does not, so implement it explicitly (implementation is the same as libstdc++).

Test Plan: JIT tests pass.

Reviewed By: zhxchen17

Differential Revision: D31715587

fbshipit-source-id: 98abcc10c2742887af866d8e70169a0187c41d33
2021-10-19 00:29:54 -07:00
Scott Wolchok
44fd312604 [PyTorch] Use intrusive_ptr to save space in KernelFunction (#65618)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65618

This saves 8 bytes per KernelFunction, which should help in resource-constrained environments.
ghstack-source-id: 140731069

Test Plan: CI

Reviewed By: ezyang

Differential Revision: D25405736

fbshipit-source-id: 757c0f1387da9147e46ac69af2aa9fffd2998e35
2021-10-18 12:53:45 -07:00
Mengwei Liu
53aac4b6f3 [PyTorch] Allow override for macro HAS_DEMANGLE (#66540)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66540

Currently the macro `HAS_DEMANGLE` is determined by compiler predefined macros. Here I'm adding an option to allow `HAS_DEMANGLE` to be defined in build files.

Test Plan: Rely on CI

Reviewed By: poweic

Differential Revision: D31600007

fbshipit-source-id: 76cf088b0f5ee940e977d3b213f1446ea64be036
2021-10-17 16:10:45 -07:00
Xue Li
2f099c7555 Revert D30652629: use irange for loops
Test Plan: revert-hammer

Differential Revision:
D30652629 (687c2267d4)

Original commit changeset: 0ae6c4bbbb55

fbshipit-source-id: 5c4f067b584a021c8c9656454d1ee60999600fb3
2021-10-15 15:23:10 -07:00
Richard Barnes
687c2267d4 use irange for loops (#66234)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66234

Modified loops in files under fbsource/fbcode/caffe2/ from the format

`for(TYPE var=x0;var<x_max;x++)`

to the format

`for(const auto var: irange(xmax))`

This was achieved by running r-barnes's loop upgrader script (D28874212) with some modification to exclude all files under /torch/jit and a number of reversions or unused variable suppression warnings added by hand.

bypass_size_limit
allow-large-files

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D30652629

fbshipit-source-id: 0ae6c4bbbb554bad42e372792a6430e1acf15e3e
2021-10-15 13:50:33 -07:00
Richard Barnes
bd25f92e81 Fix Wextra issues in Half.h (#66643)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66643

Fixes:
```
caffe2/c10/util/Half.h:456:14: error: comparison of integers of different signs: 'long' and 'unsigned long' [-Werror,-Wsign-compare]
    return f > limit::max() ||
           ~ ^ ~~~~~~~~~~~~
```

Test Plan: Sandcastle

Reviewed By: ngimel

Differential Revision: D31656816

fbshipit-source-id: 7623d20e166a9e95a949ebd8b23793f24960cf07
2021-10-15 13:38:10 -07:00
Peter Bell
5f45927d15 Autograd: Delay warnings until the end of backward execution (#66235)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/50209

This adds a new warning handler that stores all warnings in a shared
queue, which can be "replayed" at a later time and, crucially, on
another thread. Then, I use this inside the autograd engine to ensure
that warnings are processed by the handler registered on the main
thread.

For testing, I also add an operator that always warns in the backward
pass and test that the warning is a normal Python warning.

cc ezyang albanD zou3519 gqchen pearu nikitaved soulitzer Lezcano Varal7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66235

Reviewed By: ejguan

Differential Revision: D31505413

Pulled By: albanD

fbshipit-source-id: 1a7f60b038f55c20591c0748b9e86735b3fec2f9
2021-10-13 15:38:04 -07:00
Mengwei Liu
d8532e3524 [PyTorch] Split c10 Type.cpp into two files to allow targets to include one of them (#66445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66445

`Type.cpp` implements `demangle()` function based on the macro `HAS_DEMANGLE`. This diff splits it into two `.cpps` so that we can add either one into the build target. This change follows the patternof `flags_use_no_gflags.cpp` and `flags_use_gflags.cpp`.

Test Plan: Rely on CI

Reviewed By: iseeyuan

Differential Revision: D31551432

fbshipit-source-id: f8b11783e513fa812228ec873459ad3043ff9147
2021-10-11 21:52:24 -07:00
Nikita Shulga
acb0157a3d Specialization for c10::util:get_type_index<std::string> (#66290)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66290

Add full specialization for std::string type index

It slightly speeds up compilation as well as solves the ambiguity how template instantiations implemented in inline namespaces are rendered during `__PRETTY_FUNCTION__` computation.

Not sure what `#pragma` controls this behaviour, but when code is compiled by clang-12+ using libstdc++, `__PRETTY_PRINT__`, sometimes resolve `std::string` to `std::basic_string<char>` and sometimes to `std::__cxx11::basic_string<char>`, even though in the object file symbol is always inside `std::__cxx11::` namespace, which might break caffe2 serialization code that depends on dynamic hash generation

Template name resolution were debugged using https://gist.github.com/malfet/c83b9ebd35730ebf8bac7af42682ea37

(Note: this ignores all push blocking failures!)

Test Plan: CI

Reviewed By: r-barnes

Differential Revision: D31490050

fbshipit-source-id: 127091574cf6b92c7ec3f972821e4e76f5f626a9
2021-10-11 11:11:59 -07:00
Nikita Shulga
c373387709 Update CMake and use native CUDA language support (#62445)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62445

PyTorch currently uses the old style of compiling CUDA in CMake which is just a
bunch of scripts in `FindCUDA.cmake`. Newer versions support CUDA natively as
a language just like C++ or C.

Test Plan: Imported from OSS

Reviewed By: ejguan

Differential Revision: D31503350

fbshipit-source-id: 2ee817edc9698531ae1b87eda3ad271ee459fd55
2021-10-11 09:05:48 -07:00
Luca Wehrstedt
bc06eefebe [reland] Allow external CUDA streams to be set as current (#66324)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66324

Fixes https://github.com/pytorch/pytorch/issues/65822.

Reland of https://github.com/pytorch/pytorch/pull/65914.
ghstack-source-id: 140105651

Test Plan: Added tests

Reviewed By: ngimel

Differential Revision: D31506134

fbshipit-source-id: ff56203a120befdb282e974309478ac11aa56652
2021-10-11 02:41:43 -07:00
Richard Barnes
109aa135e6 Remove apparently unnecessary std::remove_cv_t (#66254)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66254

`std::decay_t` already implies dropping the const

Test Plan: Sandcastle

Reviewed By: malfet

Differential Revision: D31465856

fbshipit-source-id: 851cdb9194354fe9a89b3a37a4463a43dbbcd77a
2021-10-09 00:38:44 -07:00
Scott Wolchok
c80693f7e6 [jit] Add cache for MemoryDAG::collectAllContainedMemoryLocations (#65122)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65122

Failure to cache this seems to contribute to quadratic startup time for the static runtime.

Disclaimer: I am entirely un-versed in the performance considerations for the JIT and have no idea what the other impacts of this change may be. Let the reviewer beware.
ghstack-source-id: 140052522

Reviewed By: suo

Differential Revision: D30983268

fbshipit-source-id: 4329aee6b5781f5c2e2d2334c396fab8528d4b7b
2021-10-08 10:29:23 -07:00
Luca Wehrstedt
201174cb91 Revert D31389480: [pytorch][PR] Allow external CUDA streams to be set as current
Test Plan: revert-hammer

Differential Revision:
D31389480 (61f0bb70c1)

Original commit changeset: 2b2f40e5452c

fbshipit-source-id: c6631e51abcf3819732f981f646cb77b91569c7d
2021-10-08 09:20:24 -07:00
Scott Wolchok
1ae468a484 [jit] Refcounting spot fixes (#65346)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65346

Tidying up the top sources of reference count decrements seen during static runtime startup.
ghstack-source-id: 140027349

Test Plan:
CI

perf now shows under 2% time spend in ~__shared_count instead of about 5%.

Reviewed By: suo

Differential Revision: D31057277

fbshipit-source-id: 9a16daf2e655fda80d4ec21290b30f02ba63d8da
2021-10-08 08:39:20 -07:00
Luca Wehrstedt
61f0bb70c1 Allow external CUDA streams to be set as current (#65914)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/65822.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65914

Reviewed By: dagitses

Differential Revision: D31389480

Pulled By: lw

fbshipit-source-id: 2b2f40e5452c5b2a0b9f0f705750d2aa9deb2ead
2021-10-08 06:09:32 -07:00
CodemodService FBSourceClangFormatLinterBot
227f91e72d [AutoAccept][Codemod][FBSourceClangFormatLinter] Daily arc lint --take CLANGFORMAT
Reviewed By: zertosh

Differential Revision: D31495160

fbshipit-source-id: b0a56003a6695989dff0d325cdc118182662ec61
2021-10-07 21:09:22 -07:00
Will Constable
a8c0b362ce [pytorch][PR] Add hash and int128 utils for Lazy Tensor Core" (#66181)
Summary:
These utils are prerequisites for Lazy Node base class.
- set up new torch/csrc/lazy, test/cpp/lazy dirs
- add source files to build_variables.bzl in new lazy_core_sources var
- create new test_lazy binary

Fixes https://github.com/pytorch/pytorch/issues/65636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66181

Original commit changeset: 3d0d5377d71e

Test Plan:
Run PyTorch XLA corresponding PR in XLA CI:
https://github.com/pytorch/xla/pull/3148/files

Reviewed By: suo

Differential Revision: D31416438

fbshipit-source-id: 58a6a49c5bc30134bc6bae2e42778f359b9a8f40
2021-10-07 10:05:26 -07:00
Shijun Kong
e2be087207 [oss][pytorch] Add quint2x4 dtype (#65545)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/65545

Introduce 2bit qtensor. The new dtype added for this is c10::quint2x4

The underlying storage for this is still uint8_t, so we pack 4 2-bit values in a byte while quantizing it.

Kernels that use this dtype should be aware of the packing format. (4 2-bit values in one byte)

Test Plan: `buck test mode/dev-asan caffe2/test/:quantization -- test_qtensor`

Reviewed By: supriyar

Differential Revision: D31148141

fbshipit-source-id: 1dc1de719e097adaf93fee47c6d1b8010a3eae6c
2021-10-06 14:22:00 -07:00
Michael Suo
f062def486 Revert D31260343: [pytorch][PR] Add hash and int128 utils for Lazy Tensor Core
Test Plan: revert-hammer

Differential Revision:
D31260343 (e94fea08d0)

Original commit changeset: 8bb1194188e3

fbshipit-source-id: 3d0d5377d71ed928015bcb2105801be368e38cd8
2021-10-05 17:15:50 -07:00
Will Constable
e94fea08d0 Add hash and int128 utils for Lazy Tensor Core (#65635)
Summary:
These utils are prerequisites for Lazy Node base class.

- set up new torch/csrc/lazy, test/cpp/lazy dirs
- add source files to build_variables.bzl in new lazy_core_sources var
- create new test_lazy binary

Fixes https://github.com/pytorch/pytorch/issues/65636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/65635

Reviewed By: alanwaketan

Differential Revision: D31260343

Pulled By: wconstab

fbshipit-source-id: 8bb1194188e3e77fc42e08a14ba37faed37a9c2e
2021-10-05 16:43:55 -07:00
Dhruv Matani
e7747795c9 [PyTorch Edge] Reduce dispatch table size further for a trimmed build (#66112)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66112

Eliminate Metal and Vulkan Dispatch Keys.

Test Plan: Build + Sandcastle

Differential Revision: D31298307

fbshipit-source-id: 31302fc626382db7997e5058750fa85458c9cbc1
2021-10-05 15:24:07 -07:00