Commit graph

66 commits

Author SHA1 Message Date
Jiakai Liu
3dc0754c53 [pytorch][mobile] deprecate the LLVM-based static analyzer (#68180)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/68180

Since we've open sourced the tracing-based selective build, we can deprecate the
op-dependency-graph-based selective build and the static analyzer tool that
produces the dependency graph.
ghstack-source-id: 143108377

Test Plan: CIs

Reviewed By: seemethere

Differential Revision: D32358467

fbshipit-source-id: c61523706b85a49361416da2230ec1b035b8b99c
2021-11-11 16:37:08 -08:00
Chen Lai
355acfdebc [PyTorch Edge][tracing-based] use operator.yaml to build libtorch library (#66237)
Summary:
https://pxl.cl/1QK3N
Enable using the yaml file from tracer to build libtorch library for ios and android.

1. Android:
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1  ./scripts/build_pytorch_android.sh x86
```
libtorch_lite.so x86: 3 MB (larger than H1, static is ~3.2 MB)

2. iOS
```
SELECTED_OP_LIST=/Users/chenlai/Documents/pytorch/tracing/deeplabv3_scripted_tracing_update.yaml TRACING_BASED=1 BUILD_PYTORCH_MOBILE=1 IOS_PLATFORM=SIMULATOR  ./scripts/build_ios.sh
```
Binary size: 7.6 MB
Size:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/66237

ghstack-source-id: 140197164

Reviewed By: dhruvbird

Differential Revision: D31463119

fbshipit-source-id: c3f4eb71bdef1969eab6cb60999fec8547641cbd
2021-10-10 14:07:01 -07:00
Peter Bell
93e0f3a330 Shard Operators.cpp (#62185)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/62185

This file can take 5 minutes on its own to compile, and is the single limiting
factor for compile time of `libtorch_cpu` on a 32-core threadripper. Instead,
sharding into 5 files that take around 1 minute each cuts a full minute off the
overall build time.

This also factors out the `.findSchemaOrThrow(...).typed` step so the code can
be shared between `call` and `redispatch`.

Test Plan: Imported from OSS

Reviewed By: bdhirsh

Differential Revision: D29962049

Pulled By: albanD

fbshipit-source-id: be5df05fbea09ada0d825855f1618c25a11abbd8
2021-08-09 16:19:49 -07:00
Jiakai Liu
69b2bf70f9 [pytorch] fix tools/code_analyzer for llvm 11 (#60322)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/60322

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D29250420

Pulled By: ljk53

fbshipit-source-id: ff7f9cbacd1d9518ed81c06fc843a90d6948f760
2021-06-20 00:39:11 -07:00
Jiakai Liu
501320ed81 [pytorch] deprecate default_op_deps.yaml (#59573)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/59573

To do mobile selective build, we have several options:
1. static dispatch;
2. dynamic dispatch + static analysis (to create the dependency graph);
3. dynamic dispatch + tracing;

We are developing 3. For open source, we used to only support 1, and
currently we support both 1 and 2.

This file is only used for 2. It was introduced when we deprecated
the static dispatch (1). The motivation was to make sure we have a
low-friction selective build workflow for dynamic dispatch (2).
As the name indicates, it is the *default* dependency graph that users
can try if they don't bother to run the static analyzer themselves.
We have a CI to run the full workflow of 2 on every PR, which creates
the dependency graph on-the-fly instead of using the committed file.

Since the workflow to automatically update the file has been broken
for a while, it started to confuse other pytorch developers as people
are already manually editing it, and it might be broken for some models
already.

We reintroduced the static dispatch recently, so we decide to deprecate
this file now and automatically turn on static dispatch if users run
selective build without providing the static analysis graph.

The tracing-based selective build will be the ultimate solution we'd
like to provide for OSS, but it will take some more effort to polish
and release.

Differential Revision:
D28941020
D28941020

Test Plan: Imported from OSS

Reviewed By: dhruvbird

Pulled By: ljk53

fbshipit-source-id: 9977ab8568e2cc1bdcdecd3d22e29547ef63889e
2021-06-07 19:37:37 -07:00
Sam Estep
737d920b21 Strictly type everything in .github and tools (#59117)
Summary:
This PR greatly simplifies `mypy-strict.ini` by strictly typing everything in `.github` and `tools`, rather than picking and choosing only specific files in those two dirs. It also removes `warn_unused_ignores` from `mypy-strict.ini`, for reasons described in https://github.com/pytorch/pytorch/pull/56402#issuecomment-822743795: basically, that setting makes life more difficult depending on what libraries you have installed locally vs in CI (e.g. `ruamel`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/59117

Test Plan:
```
flake8
mypy --config mypy-strict.ini
```

Reviewed By: malfet

Differential Revision: D28765386

Pulled By: samestep

fbshipit-source-id: 3e744e301c7a464f8a2a2428fcdbad534e231f2e
2021-06-07 14:49:36 -07:00
Adnios
09a8f22bf9 Add mish activation function (#58648)
Summary:
See issus: https://github.com/pytorch/pytorch/issues/58375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/58648

Reviewed By: gchanan

Differential Revision: D28625390

Pulled By: jbschlosser

fbshipit-source-id: 23ea2eb7d5b3dc89c6809ff6581b90ee742149f4
2021-05-25 10:36:21 -07:00
Michael Carilli
bbc3cc6718 [CUDA graphs] [BC-breaking] Makes torch.cuda.amp.GradScaler scale updates in-place for better composability with graph capture (#55562)
Summary:
I'd like the following pattern (a natural composition of Amp with full fwd+bwd capture) to work:
```python
# Create "static_input" with dummy data, run warmup iterations,
# call optimizer.zero_grad(set_to_none=True), then
g = torch.cuda._Graph()
s.wait_stream(torch.cuda.current_stream())
with torch.cuda.stream(s):
    optimizer.zero_grad(set_to_none=True)
    g.capture_begin()
    with autocast():
        out = model(static_input)
        loss = loss_fn(out)
    scaler.scale(loss).backward()
    g.capture_end()
torch.cuda.current_stream().wait_stream(s)

# Training loop:
for b in data:
    # optimizer.zero_grad() deliberately omitted, replay()'s baked-in backward will refill statically held .grads
    static_input.copy_(b)
    g.replay()
    scaler.step(optimizer)
    scaler.update()
```

Right now `GradScaler` can't work with this pattern because `update()` creates the scale tensor for the next iteration out of place. This PR changes `update()` to act in place on a long-lived scale tensor that stays static across iterations.

I'm not sure how this change affects XLA (see https://github.com/pytorch/pytorch/pull/48570), so we shouldn't merge without approval from ailzhang yaochengji.

Tagged bc-breaking because it's a change to the amp update utility function in native_functions.yaml. The function was never meant to be user-facing though.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/55562

Reviewed By: zou3519

Differential Revision: D28046159

Pulled By: ngimel

fbshipit-source-id: 02018c221609974546c562f691e20ab6ac611910
2021-04-30 13:03:05 -07:00
lezcano
fd02fc5d71 Port put_ and take from TH to ATen (#53356)
Summary:
The two ports were don together, as they can be implemented with the same kernel. In TH, they were already implemented with the same kernel.

Resolves https://github.com/pytorch/pytorch/issues/24751
Resolves https://github.com/pytorch/pytorch/issues/24614
Resolves https://github.com/pytorch/pytorch/issues/24640
Resolves https://github.com/pytorch/pytorch/issues/24772

This port makes sure that it interacts correctly with the "deterministic algorithms" flag, as done in https://github.com/pytorch/pytorch/pull/51388

This PR also makes these two functions correct in the following aspects (all of them added to the tests as well):
- Support for complex numbers
- Correct handling of scalar inputs and zero-dimensional inputs
- Implementation that does not do any copies nor sorting of any of the input tensors
- Faster and more correct implementation of the backwards (now it works as it should when `source.shape() != index.shape()`)
- Now `put_(..., accumulate=True)` is implemented correctly with atomic operations on GPU / CPU (when possible) and is deterministic (modulo the loss of precision that might happen due to the reordering of a sum of floats)
- Adds the `torch.put` function that was missing, (`index_put` exists, for example)
- Corrected docs

It also adds a much more thorough testing to the operations and their gradients.

There is a BC-breaking change, and that is that now we check that the inputs do not overlap in the `put_` operation. This was handled (some of the cases, other cases were wrong) in the TH implementation by making contiguous copies of the inputs. How should we handle this one?

**Edit.** Benchmarks:
<details>
<summary>Script</summary>

```python
from IPython import get_ipython
import torch
from itertools import product

torch.manual_seed(13)
torch.set_num_threads(1)

ipython = get_ipython()

cpu = torch.device('cpu')
cuda = torch.device('cuda')

def run_test(ndims, size, index_len, device, cmd):
    print(f"cmd: {cmd}, ndims: {ndims}, tensor_size: {size}, index_len: {index_len}, device: {device}")

    large_tensor = torch.rand(*([size] * ndims), device=device)
    small_tensor = torch.rand((index_len,), device=device)
    index = torch.randint(size * ndims, (index_len,), dtype=torch.long, device=device)
    if cmd == "put":
        command = "large_tensor.put_(index, small_tensor, accumulate=False)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "accumulate":
        command = "large_tensor.put_(index, small_tensor, accumulate=True)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    elif cmd == "take":
        command = "torch.take(large_tensor, index)"
        if device == cuda:
            command += "; torch.cuda.synchronize()"
    ipython.magic(f"timeit {command}")
    print()

for method, device in product(["accumulate", "put", "take"], [cpu, cuda]):
    run_test(3, 1000, 10, device, method)
    run_test(3, 1000, 1000, device, method)
    run_test(3, 1000, 10000, device, method)
    run_test(2, 10000, 100000, device, method)
```
</details>

```python
put_(accumulate=False)
```

<details>
<summary>ATen CPU (1.5x - 2x speedup)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.05 µs ± 2.35 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.15 µs ± 5.13 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
21.6 µs ± 13.1 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
238 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
722 ns ± 2.67 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
4.89 µs ± 18.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
42.5 µs ± 96.3 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
428 µs ± 774 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.99 µs ± 16 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 24.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.4 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.6 µs ± 1.12 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: put, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
8.44 µs ± 31.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.09 µs ± 4.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.77 µs ± 0.998 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: put, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
15.8 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

```python
put_(accumulate=True)
```

<details>
<summary>ATen CPU (x2 speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.12 µs ± 2.91 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
3.14 µs ± 2.05 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
20.8 µs ± 25.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
264 µs ± 263 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
814 ns ± 1.87 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
5.11 µs ± 6.02 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
43.9 µs ± 49.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
442 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
```
</details>
<details>
<summary>ATen GPU (3x - 11x speedup)</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.01 µs ± 14.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.4 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.3 µs ± 44.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
12.6 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
34.7 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
38.2 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
61.2 µs ± 50.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

cmd: accumulate, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
140 µs ± 24.2 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

```python
take()
```

<details>
<summary>ATen CPU (1.1x speedup)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.18 µs ± 2.34 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.79 µs ± 2.96 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
16.6 µs ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
161 µs ± 984 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>

<details>
<summary>TH CPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cpu
1.1 µs ± 3.14 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cpu
2.93 µs ± 7.31 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cpu
18.6 µs ± 14.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cpu
178 µs ± 139 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```
</details>
<details>
<summary>ATen GPU (same speed)</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.38 µs ± 23.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
10.7 µs ± 9.77 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
10.6 µs ± 107 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.5 µs ± 21.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

<details>
<summary>TH GPU</summary>

```python
cmd: take, ndims: 3, tensor_size: 1000, index_len: 10, device: cuda
9.31 µs ± 7.57 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 1000, device: cuda
9.52 µs ± 5.78 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 3, tensor_size: 1000, index_len: 10000, device: cuda
9.73 µs ± 17.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

cmd: take, ndims: 2, tensor_size: 10000, index_len: 100000, device: cuda
11.7 µs ± 5.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
```
</details>

cc mruberry

Pull Request resolved: https://github.com/pytorch/pytorch/pull/53356

Reviewed By: mruberry

Differential Revision: D27520243

Pulled By: ngimel

fbshipit-source-id: e3979349c2c62d2949e09fb05e5fd4883fbc9093
2021-04-05 18:05:38 -07:00
Jiakai Liu
b2d8f0a431 [pytorch][bot] update mobile op deps (#52110)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/52110

LLVM_DIR=/usr ANALYZE_TORCH=1 tools/code_analyzer/build.sh
cp build_code_analyzer/work/torch_result.yaml tools/code_analyzer/default_op_deps.yaml

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D26419138

Pulled By: ljk53

fbshipit-source-id: 26bf00036b19ad18a9cf06111df4d9fe32e5feab
2021-02-12 14:50:29 -08:00
nikitaved
c458558334 kill multinomial_alias_setup/draw (#50489)
Summary:
As per title. Partially Fixes https://github.com/pytorch/pytorch/issues/49421.
These functions appear to be dead code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/50489

Reviewed By: mruberry

Differential Revision: D25948912

Pulled By: ngimel

fbshipit-source-id: 108723bd4c76cbc3535eba902d6f74597bfdfa58
2021-01-19 00:23:58 -08:00
Jiakai Liu
5252e9857a [pytorch] clean up unused util srcs under tools/autograd (#50611)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/50611

Removed the unused old-style code to prevent it from being used.
Added all autograd/gen_pyi sources to mypy-strict.ini config.

Confirmed byte-for-byte compatible with the old codegen:
```
Run it before and after this PR:
  .jenkins/pytorch/codegen-test.sh <baseline_output_dir>
  .jenkins/pytorch/codegen-test.sh <test_output_dir>

Then run diff to compare the generated files:
  diff -Naur <baseline_output_dir> <test_output_dir>
```

Confirmed clean mypy-strict run:
```
mypy --config mypy-strict.ini
```

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25929730

Pulled By: ljk53

fbshipit-source-id: 1fc94436fd4a6b9b368ee0736e99bfb3c01d38ef
2021-01-18 23:54:02 -08:00
Sebastian Messmer
4a14020c0d Remove .impl_UNBOXED() and functionalities associated with it (#49220)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49220

Since all ops are c10-full, we can remove .impl_UNBOXED now.
This also removes the ability of KernelFunction or CppFunction to store unboxedOnly kernels.
ghstack-source-id: 119450489

Test Plan: waitforsandcastle

Reviewed By: ezyang

Differential Revision: D25490225

fbshipit-source-id: 32de9d591e6a842fe18abc82541580647e9cfdad
2021-01-06 14:22:46 -08:00
Brian Hirsh
b5149513ec migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API, update code_analyzer regex (#48308)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48308

The original regex that I added didn't correctly match namespaces that started with an underscore (e.g. `_test`), which caused a master-only test to fail.

The only change from the previous commit is that I updated the regex like so:

before: `^.*TORCH_LIBRARY_IMPL_init_([^_]+)_([^_]+)_[0-9]+(\(.*)?$`
after: `^.*TORCH_LIBRARY_IMPL_init_([_]*[^_]+)_([^_]+)_[0-9]+(\(.*)?$`

I added in a `[_]*` to the beginning of the namespace capture. I did the same for the `_FRAGMENT` regex.

Verified that running `ANALYZE_TEST=1 tools/code_analyzer/build.sh` (as the master-only test does) produces no diff in the output.

Fixing regex pattern to allow for underscores at the beginning of the
namespace

This reverts commit 3c936ecd3c.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25123295

Pulled By: bdhirsh

fbshipit-source-id: 54bd1e3f0c8e28145e736142ad62a18806bb9672
2020-11-30 13:05:33 -08:00
Brian Hirsh
3c936ecd3c Revert D25056091: migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API
Test Plan: revert-hammer

Differential Revision:
D25056091 (0ea4982cf3)

Original commit changeset: 0f647ab9bc5e

fbshipit-source-id: e54047b91d82df25460ee00482373c4580f94d50
2020-11-19 19:10:14 -08:00
Brian Hirsh
0ea4982cf3 migrate export_caffe2_op_to_c10.h macros to the new dispatcher registration API (#48097)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/48097

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D25056091

Pulled By: bdhirsh

fbshipit-source-id: 0f647ab9bc5e5aee497dac058df492f6e742cfe9
2020-11-19 17:56:56 -08:00
Jiakai Liu
4f538a2ba4 [pytorch][bot] update mobile op deps (#47825)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/47825

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D24913587

Pulled By: ljk53

fbshipit-source-id: b6219573c3238fb453d88019197a00c9f9dbabb8
2020-11-12 19:19:25 -08:00
Nikita Shulga
3d962430a9 Make gen_op_registration flake8 compliant (#47604)
Summary:
Fixes regression introduced by D24686838 (8182558c22)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/47604

Reviewed By: walterddr

Differential Revision: D24832687

Pulled By: malfet

fbshipit-source-id: e9f7a35561c2b1705e11fd11abe402e3c83cf5cc
2020-11-09 08:31:07 -08:00
Martin Yuan
8182558c22 [PyTorch Mobile] Don't use __ROOT__ for inference only ops
Summary:
`__ROOT__` ops are only used in full-jit. To make size compact, disable using it in inference. Since FL is still in fill-jit, keep it for training only.

It saves -17 KB for fbios.

TODO: when FL is migrated to lite_trainer, remove `__ROOT__` to save size in training too.

Test Plan: CI

Reviewed By: dhruvbird

Differential Revision: D24686838

fbshipit-source-id: 15214cebb9d8defa3fdac3aa0d73884b352aa753
2020-11-08 15:27:47 -08:00
albanD
27e2ea4cea Make add_relu an internal function (#46676)
Summary:
Cleanup for 1.7

Pull Request resolved: https://github.com/pytorch/pytorch/pull/46676

Reviewed By: gchanan

Differential Revision: D24458565

Pulled By: albanD

fbshipit-source-id: b1e4b4630233d3f1a4bac20e3077411d1ae17f7b
2020-10-22 18:08:15 -07:00
Dhruv Matani
75322dbeb4 [PyTorch] [BUCK] Replace pt_deps.bzl with a YAML operator dependency file which is generated by the code analyser (#46057)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46057

The code analyser (that uses LLVM and runs in the OSS PyTorch git repo) already produces a YAML file which contains base operator names and the operators that they depend on. Currently, this operator dependency graph is converted into a python dictionary to be imported in BUCK and used there. However, it is mostly fed into other executables by serializing the JSON and the consumer pieces this JSON together by concatenating each argument together. This seems unnecessary. Instead, this diff retains the original YAML file and makes all consumers consume that same YAML file.
ghstack-source-id: 114641582

Test Plan: Build Lite Predictor + sandcastle.

Reviewed By: iseeyuan

Differential Revision: D24186303

fbshipit-source-id: eecf41bf673d90b960c3efe7a1271249f0a4867f
2020-10-20 02:00:36 -07:00
Dhruv Matani
0c5cd8c2b9 [RFC] Switch PyTorch Selective Build (Custom Build) to use the SelectiveBuilder abstraction (#45722)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45722

This diff does a bunch of things:

1. Introduces some abstractions as detailed in https://fb.quip.com/2oEzAR5MKqbD to help with selective build related codegen in multiple files.
2. Adds helper methods to combine operators, debug info, operator lists, etc...
3. Currently, the selective build machinery querying `op_registration_whitelist` directly at various places in the code. `op_registration_whitelist` is a list of allowed operator names (without overload name). We want to move to a world where the overload names are also included so that we can be more selective about which operators we include. To that effect, it makes sense to hide the checking logic in a separate abstraction and have the build use that abstraction instead of putting all this selective build specific logic in the code-generator itself. This change is attempting to do just that.
4. Updates generate_code, unboxing-wrapper codegen, and autograd codegen to accept the operator selector paradigm as opposed to a selected operator list.
5. Update `tools/code_analyzer/gen_op_registration_allowlist.py` to expose providing an actual structured operator dependency graph in addition to a serialized string.

There are a bunch of structural changes as well:

1. `root_op_list.yaml` and `combined_op_list.yaml` are now actual YAML files (not a space separated list of operator names)
2. `generate_code.py` accepts only paths to operator list YAML files (both old style as well as new style) and not list of operator names on the command line as arguments
3. `gen.py` optionally also accepts a custom build related operators YAML path (this file has information about which operators to register in the generated library).

ghstack-source-id: 114578753

(Note: this ignores all push blocking failures!)

Test Plan:
`buck test caffe2/test:selective_build`

Generated YAML files after the change:

{P143981979}

{P143982025}

{P143982056}

Ensure that the generated files are same before and after the change:

```
[dhruvbird@devvm2490 /tmp/TypeDefault.cpp] find -name "*.cpp" | xargs md5sum
d72c3d125baa7b77e4c5581bbc7110d2  ./after_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f  ./after_change/lite_predictor_lib_aten/TypeDefault.cpp
d72c3d125baa7b77e4c5581bbc7110d2  ./before_change/gen_aten/TypeDefault.cpp
42353036c83ebc7620a7159235b9647f  ./before_change/lite_predictor_lib_aten/TypeDefault.cpp
```

`VariableTypes_N.cpp` are generated the same both before and after the change:

```
[dhruvbird@devvm2490 /tmp/VariableType] find -name "*.cpp" | xargs -n 1 md5sum | sort
3be89f63fd098291f01935077a60b677  ./after/VariableType_2.cpp
3be89f63fd098291f01935077a60b677  ./before/VariableType_2.cpp
40a3e59d64e9dbe86024cf314f127fd6  ./after/VariableType_4.cpp
40a3e59d64e9dbe86024cf314f127fd6  ./before/VariableType_4.cpp
a4911699ceda3c3a430f08c64e8243fd  ./after/VariableType_1.cpp
a4911699ceda3c3a430f08c64e8243fd  ./before/VariableType_1.cpp
ca9aa611fcb2a573a8cba4e269468c99  ./after/VariableType_0.cpp
ca9aa611fcb2a573a8cba4e269468c99  ./before/VariableType_0.cpp
e18f639ed23d802dc4a31cdba40df570  ./after/VariableType_3.cpp
e18f639ed23d802dc4a31cdba40df570  ./before/VariableType_3.cpp
```

Reviewed By: ljk53

Differential Revision: D23837010

fbshipit-source-id: ad06b1756af5be25baa39fd801dfdf09bc565442
2020-10-18 15:10:42 -07:00
vishalrao487
d2623da52c replaced whitelist with allowlist (#45260)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/41754

**(1)**
Intially file was named **gen_op_registration_whitelist.py** I changed it to **gen_op_registration_allowlist.py**

**(2)**
There were some **whitelist** in comment inside the file, I changed it to **allowlist**
![update1](https://user-images.githubusercontent.com/62737243/94106752-b296e780-fe59-11ea-8541-632a1dbf90d6.png)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/45260

Reviewed By: dhruvbird

Differential Revision: D23947182

Pulled By: ljk53

fbshipit-source-id: 31b486592451dbb0605d7950e07747cbb72ab80f
2020-09-29 00:27:46 -07:00
Jiakai Liu
4a9c80e82e [pytorch][bot] update mobile op deps (#44854)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44854

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23751925

Pulled By: ljk53

fbshipit-source-id: 8e1905091bf3abaac20d97182eb88f96e905ffc2
2020-09-17 18:33:13 -07:00
Jiakai Liu
3fa7f515a5 [pytorch][bot] update mobile op deps (#44700)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44700

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23719486

Pulled By: ljk53

fbshipit-source-id: 39219ceeee51861f90b228fdfe2ab59ac8a9704d
2020-09-16 17:20:15 -07:00
Ann Shan
0e3cf6b8d2 [pytorch] remove code analyzer build folder between builds (#44148)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44148

Automatically remove the build_code_analyzer folder each time build.sh is run
ghstack-source-id: 111458413

Test Plan:
Run build.sh with different options and compare the outputs (should be different).
Ex:
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=OFF' tools/code_analyzer/build.sh `

should produce a shorter file than
`ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseops MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: iseeyuan

Differential Revision: D23503886

fbshipit-source-id: 9b95d4365540da0bd2d27760e1315caed5f44eec
2020-09-04 10:38:12 -07:00
Jiakai Liu
b10c527a1f [pytorch][bot] update mobile op deps (#44100)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44100

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23496532

Pulled By: ljk53

fbshipit-source-id: 1e5b9059482e423960349d1361a7a98718c2d9ed
2020-09-03 11:24:26 -07:00
Jiakai Liu
402e9953df [pytorch][bot] update mobile op deps (#44018)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/44018

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D23470528

Pulled By: ljk53

fbshipit-source-id: b677e1c5677fc8929713ee108df69098502c50ea
2020-09-02 14:34:33 -07:00
Jiakai Liu
76ca365661 [pytorch][bot] update mobile op deps (#43937)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43937

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23443927

Pulled By: ljk53

fbshipit-source-id: 526ca08dfb5bd32527bff98b243da90dbbf2ea49
2020-09-01 10:07:52 -07:00
Jiakai Liu
ffca81e38b [pytorch][bot] update mobile op deps (#43871)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/43871

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23422523

Pulled By: ljk53

fbshipit-source-id: 95f2a1b6a2d25b13618c65944a2b919922083fb8
2020-08-31 14:42:12 -07:00
Jiakai Liu
3a0e35c9f2 [pytorch] deprecate static dispatch (#43564)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43564

Static dispatch was originally introduced for mobile selective build.

Since we have added selective build support for dynamic dispatch and
tested it in FB production for months, we can deprecate static dispatch
to reduce the complexity of the codebase.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23324452

Pulled By: ljk53

fbshipit-source-id: d2970257616a8c6337f90249076fca1ae93090c7
2020-08-27 14:52:48 -07:00
Jiakai Liu
3afd24d62c [pytorch] check in default generated op dependency graph (#43570)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43570

Add the default op dependency graph to the source tree - use it if user runs
custom build in dynamic dispatch mode without providing the graph.

Test Plan: Imported from OSS

Reviewed By: ezyang

Differential Revision: D23326988

Pulled By: ljk53

fbshipit-source-id: 5fefe90ca08bb0ca20284e87b70fe1dba8c66084
2020-08-27 14:51:44 -07:00
Ann Shan
87905b5856 [pytorch] add option to include autograd for code analyzer (#43155)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43155

Update the code_analyzer build.sh script to be able to take additional build flags in the mobile build/analysis

Test Plan:
Checkout associated PR or copy contents of build.sh into PyTorch repo (must be run from root of PyTorch repo)

To run with inclusion of autograd dependencies (note BUILD_MOBILE_AUTOGRAD is still an experimental build flag): `ANALYZE_TORCH=1 DEPLOY=1 BASE_OPS_FILE=/path/to/baseopsfile MOBILE_BUILD_FLAGS='-DBUILD_MOBILE_AUTOGRAD=ON' tools/code_analyzer/build.sh`

Reviewed By: ljk53

Differential Revision: D23065754

fbshipit-source-id: d83a7ad62ad366a84725430ed020adf4d56687bd
2020-08-24 15:04:43 -07:00
Edward Yang
7c50c2f79e Reimplement per-operator selective build (#39401)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/39401

This uses the technique proposed by smessmer in D16451848 to selectively
register operators without codegen.  See the Note inside for more
details.

This PR has feature parity with the old selective build apparatus:
it can whitelist schema def()s, impl()s, and on a per dispatch key
basis.  It has expanded dispatch key whitelisting, whereas previously
manually written registrations were not whitelisted at all.  (This
means we may be dropping dispatch keys where we weren't previously!)

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Reviewed By: pbelevich

Differential Revision: D21905593

Pulled By: ezyang

fbshipit-source-id: d4870f800c66be5ce57ec173c9b6e14a52c4a48b
2020-08-20 19:10:02 -07:00
Jiakai Liu
8ddd2c4e1b [pytorch] fix code analyzer for LLVM 9 & 10 (#42135)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42135

Tested the code analyzer with LLVM 9 & 10 and fixed a couple issues:
- Rename local demangle() which is available as public API since LLVM 9;
- Fix falsely associated op registrations due to the `phi` instruction;

Test Plan: Imported from OSS

Reviewed By: iseeyuan

Differential Revision: D22795508

Pulled By: ljk53

fbshipit-source-id: 2d47af088acd3312a7ea5fd9361cdccd48940fe6
2020-07-28 14:57:07 -07:00
Jiakai Liu
f92089b8ca [pytorch] tweak code analyzer script to handle new namespaces (#40276)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/40276

- add a couple new namespaces;
- handle the case where both contextual namespace and opreator namespace
  are set (BackendSelectRegister.cpp and #39401);
- improve error message;

Test Plan: Imported from OSS

Differential Revision: D22135686

Pulled By: ljk53

fbshipit-source-id: 14d359c93573349b8fe1e05d7e44d875295a5f6d
2020-06-19 14:54:21 -07:00
Sebastian Messmer
5af4e76683 Back out "Revert D21530545: Remove call_unboxed_super_slow_temp_shim" (#38742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38742

Original commit changeset: af9013ed37d2
ghstack-source-id: 104397898

Test Plan: waitforsandcastle

Differential Revision: D21651660

fbshipit-source-id: 8bb56eb8abd43fd01d1468f104babe92a09d2ad4
2020-05-19 18:23:20 -07:00
Sebastian Messmer
363a2d9455 Revert D21530545: Remove call_unboxed_super_slow_temp_shim
Test Plan: revert-hammer

Differential Revision:
D21530545

Original commit changeset: cdfb801e5519

fbshipit-source-id: af9013ed37d27bf8dca859902918c02eb8cceeb4
2020-05-19 16:07:36 -07:00
Sebastian Messmer
423a00ad39 Remove call_unboxed_super_slow_temp_shim (#38351)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/38351

ghstack-source-id: 104368838

Test Plan: waitforsandcastle

Differential Revision: D21530545

fbshipit-source-id: cdfb801e551993ecb339f3f8ec7c9b3039766989
2020-05-19 14:19:28 -07:00
Sebastian Messmer
6968c8153e Warn against callOp (#37797)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37797

This is slow (see comment in code).
Not fixing this yet, but at least adding a warning so people are aware and don't add new call sites.
ghstack-source-id: 103887226

Test Plan: waitforsandcastle

Differential Revision: D21390364

fbshipit-source-id: 7bff1c3b9756a16c9d9110f209c23bf557266dda
2020-05-11 19:21:50 -07:00
Jiakai Liu
9d0891f886 [pytorch][buck] tweak code analyzer e2e script
Summary:
- Add debug mode to include debug information.
- Move codegen comment to FB shell script (as it's only checked-in FB repo).
- Analyze lite-predictor instead of full-JIT as full-JIT BUCK target contains variable kernels thus pull in a lot more dependencies.
- Use pre-opt bitcode instead of pre-codegen bitcode - there is one special `callOp()` case in RNN.cpp where optimized bitcode has opname string and API body inlined together: https://fburl.com/diffusion/8rz6u4rg; pre-optimization bitcode should give more stable result.

Test Plan: - Tested the bash script with stacked diff.

Reviewed By: iseeyuan

Differential Revision: D21298837

fbshipit-source-id: be33e2db5d8cb0f804460c503e52beb0dcb4857f
2020-04-29 22:38:09 -07:00
Jiakai Liu
8258d42bd0 [pytorch] add '__BASE__' section to op deps to factor out frequently used util ops (#37404)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37404

Many aten operators are really like util functions, e.g.:
aten::is_nonzero, aten::is_floating_point, etc. These ops can be called
via overloaded c++ operator, so seemingly trivial and innocent code changes can
affect how these ops are used by other ops (thus changes the output of
static analyzer).

Most of these util ops are rather small in terms of build size cost, so
for the purpose of optimizing binary size with custom build, whether to
include these ops or not does not make significant difference. In fact
for non-trivial models a set of these ops are almost always used.

This PR introduced the (optional) '__BASE__' ops section to the dependency graph.

We can maintain the list of frequently used small util ops for internal BUCK
build. This way, the output dependency graph will only contain meaningful
edges with significant binary size impact, and it will be more stable from
trivial code changes (which is checked in FB codebase).

Having a stable and sparse deps graph by factoring out frequently used based ops
is also a nice property to allow us to explore alternative custom build
solutions in case we find it hard to maintain the static code analyzer.

Test Plan: Imported from OSS

Differential Revision: D21280835

Pulled By: ljk53

fbshipit-source-id: c4d0d1f07ca868c60f23118d877fc1eeead4c875
2020-04-28 17:18:09 -07:00
Jiakai Liu
e0a5b443d6 [pytorch] remove unused flags from code analyzer & move format support to python (#37393)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/37393

Simplify the code analyzer by removing some unused flags and moving the
different format printer logic to python script. It's easier to add other
post processing logic to adapt to different BUCK build configs.

Test Plan: Imported from OSS

Differential Revision: D21280836

Pulled By: ljk53

fbshipit-source-id: 0d66d5891d850f012c4ab4f39eabbd9aecc1caa9
2020-04-28 17:16:55 -07:00
Edward Yang
a894fff265 Back out "Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API"
Summary: Original commit changeset: 636e8a11afc6

Test Plan: export to OSS

Reviewed By: malfet

Differential Revision: D21170502

fbshipit-source-id: e8f35f103c4924aedbcaaf868475008d24bdeeab
2020-04-22 09:18:23 -07:00
James Reed
2ccdc39dce Revert D21089648: Put TORCH_LIBRARY in torch/library.h; add custom class API
Test Plan: revert-hammer

Differential Revision:
D21089648

Original commit changeset: 8d54329c1252

fbshipit-source-id: 636e8a11afc628a4cdae9d44824985c10c70555e
2020-04-21 12:21:45 -07:00
Edward Yang
01100cb477 Put TORCH_LIBRARY in torch/library.h; add custom class API (#36742)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36742

Now, you can define a custom class inside a TORCH_LIBRARY block.
It looks very similar to what you did before.  Instead of

```
static auto m = torch::class_<Class>("Namespace", "Class").def("foo", foo);
```

you write

```
TORCH_LIBRARY(Namespace, m) {
  m.class_<Class>("Class")
    .def("foo", foo);
}
```

All the old usages still work, but at some point we should start
updating the tutorials when we're ready to go 100% live with the
new pybind11 style API.

custom class API previously lived in torch/ folder and in torch
namespace, so for consistency, the new TORCH_LIBRARY also got
moved to torch/library.h The definition of Library::class_ is in the
bottom of that header because I need all of the class_ constructors
available, but there is a circular dependency between the two headers.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Differential Revision: D21089648

Test Plan: Imported from OSS

Pulled By: ezyang

fbshipit-source-id: 8d54329c125242605336c22fa1642aae6940b507
2020-04-21 10:05:21 -07:00
Edward Yang
e29348f828 Switch to pybind11 style registration function API. (#36258)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36258

Previous we had a && chaining style API.  There are some downsides to
this API:

- It's easy to forget the 'static' qualifier in front, leading to
  subtle ODR bugs.
- It is not compatible with torchbind class_ definitions, as these
  need multiple levels of chaining.  So in practice people end
  up having to define multiple static initializers, one per class.
- It's not like pybind11.
- There's no way to conveniently get the file and line number of
  the registration, as there is no macro point in the API.
- The old API doesn't really encourage people to put all of their
  definitions for a library in one place, and to give a custom
  namespace for it.  Similarly, the old API wasn't very DRY, because
  you had to keep repeating the namespace/dispatch key you
  were writing implementations for.

The new API is modeled exactly off of the PYBIND11_MODULE macro:
you write:

```
TORCH_LIBRARY(aten, m) {
  m.def("aten::add(Tensor self, Tensor other) -> Tensor");
  ...
}
```

in a non-chaining fashion, and under the hood the macro expands to
define a function, and define a static initializer that allocates
c10::Library (previously called c10::Module, but we renamed it
to avoid confusion with the existing NN module concept), passes
it to your function, and then retains it for the rest of the lifetime
of the program.  Specification of the namespace is mandatory,
and in later commit I plan to make it a hard error to TORCH_LIBRARY
the same library name twice.

If you are specifying an implementation for an existing operator
(e.g., you're the XLA backend, or even if you're just putting
registrations for implementations at the implementation site),
you should use TORCH_LIBRARY_IMPL, which instead takes a backend
argument (instead of namespace) and can be used to specify an
implementation for a backend.  Unlike TORCH_LIBRARY, you can do
as many of these as you want for a backend.

This needs updates to the mobile code analyzer.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929257

Pulled By: ezyang

fbshipit-source-id: ba04d78492e8c93ae7190165fb936f6872896ada
2020-04-16 10:44:21 -07:00
Jiakai Liu
f98e0a099a [pytorch] handle pybind11 style registration API with code analyzer (#36607)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36607

PR #36258 and subsequent PRs in the stack switch c10 registrations to
the new pybind11 style registration API. One notable difference from old
c10 registration API is that, operator's namespace is no longer in op
schema string, e.g. "aten::" will be factored out from "aten::conv",
"aten::emtpy" and etc. The namespace string will be declared at the
beginning of registrations with TORCH_LIBRARY / TORCH_LIBRARY_IMPL
macro.

A rather simple fix is to extract namespace string from the name of
enclosing function of registrations, as the TORCH_LIBRARY macro will
always create an init function (per namespace) by appending namespace
string to a common prefix.

Another side effect of the API change is that it adds some debug string
constants to the registration API, and because of factoring out the
namespace part from op name, there is no longer an effect way to
differentiate between real op name and debug strings. A simple
workaround is that we only keep the first string constant it encounters
while BFSing the LLVM IR - the real op name is directly passed into the
registration call while the debug string is indirectly passed via
CppFunction.

These new assumptions might be broken by future changes but it's so simple
to implement to unblock the API work.

Test Plan: Imported from OSS

Differential Revision: D21026008

Pulled By: ljk53

fbshipit-source-id: c8c171d23aaba6d6b7985d342e8797525126a713
2020-04-15 11:03:41 -07:00
Jiakai Liu
9cac2b83d9 [pytorch] improve code analyzer to dump ops called from c++ functions (#35941)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/35941

The key step of mobile custom build is to find out ops used by specific
model, with which it can produce a tailored build of optimal size.

However, ops can not only be called from TorchScript model but can also
be called from C++ code directly, e.g.: via torch::jit:: APIs. With
static dispatch, ops called this way will be statically linked into client
code. With dynamic dispatch, we need obtain & keep these ops explicitly.

This PR improves static code analyzer to dump ops that are called from
visible c++ symbols matching specific regex. This provides a mechanism
to solve the custom build problem with dynamic dispatch.

It starts with dumping ops that are callable from functions in torch::jit
namespace and include them in custom build with dynamic dispatch. We can
extend it to analyze custom code / to refine the set of JIT APIs that
are relevant, and etc. This is just a preliminary version. We need
improve its usability for more general purpose.

Test Plan: Imported from OSS

Differential Revision: D20835166

Pulled By: ljk53

fbshipit-source-id: a87cfb22b34f89545edd0674a5dfca6b7cff2b0c
2020-04-14 23:21:19 -07:00
Edward Yang
2de3e491a8 [RELAND] Add temporary impl_UNBOXED syntax sugar for unboxed-only defs. (#36223)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/36223

Previously #35714

There are a lot of unboxed only defs.  We're committed to removing
them at the end of the half but as I am about to do a lot of porting
to the new API, let's get them into a form where they're easy to
remove.  This is a new overload impl_UNBOXED that will pass
the function pointer straight to CppFunction::makeUnboxedOnly

I don't attempt to make the _UNBOXED API complete; in particular,
catchall declarations don't get this sugar (as there are very few
of them).

To get some coverage of _UNBOXED API for code analysis, I switched
one of our unboxed tests to be an impl rather than a def.  This
shouldn't materially affect coverage.

Signed-off-by: Edward Z. Yang <ezyang@fb.com>

Test Plan: Imported from OSS

Differential Revision: D20929259

Pulled By: ezyang

fbshipit-source-id: 72d2061b6c8a6afbcd392b47f53ade18de2f9184
2020-04-09 14:58:33 -07:00