Commit graph

224 commits

Author SHA1 Message Date
Akshay Parashar
38169c2287 [Static Runtime] Fix precision error in test cases (#80935)
Summary:
- Test cases related to DeepAndWideSciptModel() was crashing at random due to precision issue
- test cases related for precision: DeepWide, KWargsAPI_1, KWargsAPI_2, KWargsAPI_Optional, FusionPass
- test failure was not observed always due to random input to the model (via torch::randn)
- Increasing the absolute tolerance for test cases

Differential Revision: D37639067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/80935
Approved by: https://github.com/mikeiovine
2022-07-06 16:31:18 +00:00
Hui Guo
a622e3e14d [static runtime] Fix linalg_solve test case (#79971)
Summary: Added the third bool argument in inalg_solve test case to remove the runtime error.

Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: mikeiovine

Differential Revision: D37324419

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79971
Approved by: https://github.com/tenpercent
2022-06-22 16:31:25 +00:00
Max Podkorytov
bf75708ce4 [static-runtime] add nnc codegen for aten::div (#76903)
Differential Revision: D36151087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/76903
Approved by: https://github.com/mikeiovine
2022-06-22 05:47:44 +00:00
Hui Guo
0545c85f74 [static runtime] Add JIT prim ops: aten::cpu, aten::list, aten::numel, aten::__range_length (#79111)
Summary: This adds the missing jit prim ops appear in the non ads models for c2->pt mitigation: aten::cpu, aten::list, aten::numel, aten::__range_length

Test Plan: static runtime unit tests

Differential Revision: D36984960

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79111
Approved by: https://github.com/davidberard98
2022-06-18 16:38:58 +00:00
Hui Guo
aee9762a51 [static runtime] Disable unit test for linalg_svdvals (#79574)
Summary: The test is throwing a jit alias analysis not supporting error. Disabling it for now.

Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: mikeiovine

Differential Revision: D37056032

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79574
Approved by: https://github.com/mikeiovine
2022-06-15 20:35:19 +00:00
Hui Guo
8d7fcfa8f1 [static runtime] Add native ops: aten::index_put, aten::item, aten::tensor_split (#79065)
Summary: This adds the pytorch operators that are currently missing in non-ads models from c2->pt mitigation: aten::index_put, aten::item, aten::tensor_split

Test Plan: buck run mode/opt caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36984961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/79065
Approved by: https://github.com/davidberard98
2022-06-15 19:15:34 +00:00
Akshay Parashar
28f87b9cf9 [Static Runtime] Fix aten::clone out variant (#78297) (#78322)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78297

Clone followed by expand/expand_as due to memoryOverlap check on copy_ native method. Refer to T118519310 for more details.

Crashing test case:
a = tensor(3,1)			  // strides = (1,1)
B = tensor(3,2)	          // strides = (2,1)
Temp = a.expand_as(b).   // creates temp with shape as (3,2) and strides as (1,0)
temp.clone()		         // crashe on copy_ due to memoryOverlap

Fix: Disable the out variant for the expanded tensor.
- Calls native clone instead of out variant for clone dealing with expanded tensors
- Added test case for both clone variants (out and native clones)
- Increased the tensor size for memory planner test case to trigger dynamic allocation

Test Plan:
buck test caffe2/benchmarks/static_runtime/fb:test_fb_operators

buck test caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36672180

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78322
Approved by: https://github.com/mikeiovine
2022-06-02 21:06:59 +00:00
Max Podkorytov
ebfc70f37a [static-runtime] out variant for aten::mean (#78161)
Summary: As subject

Test Plan: Added unit tests

Differential Revision: D36614633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78161
Approved by: https://github.com/mikeiovine
2022-06-02 20:56:42 +00:00
Max Podkorytov
2679755bdc [static-runtime] out variant for aten::max (#78271)
Summary: Previously the op was auto-generated but it only covered the pointwise overload of aten::max. This adds support for reduction, overall and along a dim

Test Plan: Added a unit test

Differential Revision: D36656378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78271
Approved by: https://github.com/mikeiovine
2022-05-26 23:29:27 +00:00
Hui Guo
d12bf9fd75 [static_runtime] Add auto-generated view ops (#77106)
Summary: This includes the generated view ops from D36258767.

Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Differential Revision: D36258968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77106
Approved by: https://github.com/alanwaketan, https://github.com/tenpercent
2022-05-26 03:13:59 +00:00
mikeiovine
56c23f5633 [SR] Out variant for embedding_bag_byte_unpack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77661

Add an out variant and wrapper in static runtime.

I just added the declaration with the others in `qembeddingbag.h` for now (rather than properly adding the out variant to the torch library). This can be fixed in a followup.

Differential Revision: [D36449840](https://our.internmc.facebook.com/intern/diff/D36449840/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36449840/)!

Approved by: https://github.com/tenpercent
2022-05-25 23:24:11 +00:00
mikeiovine
2ae3c59e4b [SR] Remove linear/relu fusion
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77620

Apparently, this is not implemented in fbgemm, so it's strictly worse than using NNC.

Differential Revision: [D36431811](https://our.internmc.facebook.com/intern/diff/D36431811/)

Approved by: https://github.com/hlu1
2022-05-23 21:46:27 +00:00
Hao Lu
c60d2ef4eb [StaticRuntime] Replace Permute with copy version only when it's followed by reshape or flatten (#77832)
Reviewed By: mikeiovine

Differential Revision: D36466622

Pull Request resolved: https://github.com/pytorch/pytorch/pull/77832
Approved by: https://github.com/mikeiovine
2022-05-20 03:14:01 +00:00
mikeiovine
02713221e3 [SR] Fuse clamp/nan_to_num
Pull Request resolved: https://github.com/pytorch/pytorch/pull/77094

Fuse `clamp` and `nan_to_num` in an NNC kernel. This leads to a big speed up on many models. We can avoid comparisons since clamp potentially gets rid of all of the `inf`s in the input tensor.

Differential Revision: [D36220967](https://our.internmc.facebook.com/intern/diff/D36220967/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D36220967/)!

Approved by: https://github.com/navahgar
2022-05-10 23:33:59 +00:00
Mike Iovine
849984a2cd [SR] Sigmoid out variant calls fast_sigmoid (#75661)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75661

`fast_sigmoid` is a variant of sigmoid in NNC that is implemented in terms of `fast_tanh` (which is a fast rational function approximation).
ghstack-source-id: 155604086

Reviewed By: navahgar, hlu1

Differential Revision: D35481390

fbshipit-source-id: 1d64b5c375539f3b2461a1f3d9b86cd696eae7a1
(cherry picked from commit 8106c2512b8d7b373cb6545a43c3e8fc04805c4b)
2022-05-06 00:14:30 +00:00
Mike Iovine
1fed6b7559 [SR] Eliminate extra permutes around softmax calls (#76391)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76391

I've seen this pattern in many important internal models:

```
x = torch.permute(a, [0, 2, 1])
y = torch.softmax(x, 2)
z = torch.permute(y, [0, 2, 1])
```

This is equivalent to
```
z = torch.softmax(x, 1)
```

The `permute` ops can degrade performance, especially if copy variants are on. Add another pattern to our `EliminateExtraPermuteOpsPass` to handle this.
ghstack-source-id: 155466506

Test Plan: New unit tests

Reviewed By: navahgar, huiguoo

Differential Revision: D35938289

fbshipit-source-id: 398b5528077b0b3f1c6fc5544e483803e96d68e9
(cherry picked from commit d742abd094d1fef23ca6a34703d97a6da2d14bd1)
2022-05-04 23:08:49 +00:00
Mike Iovine
cac2733af1 [SR] Codegen for aten::clamp (#76340)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76340

NNC kernel for `clamp` scalar case
ghstack-source-id: 155466507

Reviewed By: navahgar, huiguoo

Differential Revision: D35904019

fbshipit-source-id: e4115757f7e2cbdf364b88be3f599dfc3028750f
(cherry picked from commit bdc4b918bc5a14490f46c79793f764b28c18388f)
2022-05-04 23:08:49 +00:00
Hui Guo
bcddd4ab3e [Static Runtime] Add auto generated unstructured ops (#76398)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76398

This diff adds the large files that include the newly generated ops from D34913736. Refer to the base diff for more details.

Test Plan: buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest

Reviewed By: mikeiovine, tenpercent

Differential Revision: D35945633

fbshipit-source-id: 53497bd5c490a57ea1521837762f740deb42bfd8
(cherry picked from commit e0fbdcb0bf09f5c192430f95f450c0a946c80074)
2022-05-04 19:34:19 +00:00
Mike Iovine
fc64dbdc01 [SR] Fuse quantized linear/relu (#75775)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75775

fbgemm kernels already implement the fused kernel, no reason not to use it
ghstack-source-id: 155450342

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35633297

fbshipit-source-id: a744a33a65ce7dbb9ce8900dbe091b6d56dd4e48
(cherry picked from commit b1361b349862715aa17e6318c5e658cd6401a464)
2022-05-04 19:01:14 +00:00
Mike Iovine
b02b3f25db [SR] Quick hack to eliminate no-op slice (#75774)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75774

`list[0:]` is a no-op. This should really be eliminated on the modeling side, implement as a graph pass for now until we can get this into prod models.

Test Plan: New unit tests

Reviewed By: navahgar

Differential Revision: D35632947

fbshipit-source-id: 0c564193c35039130e99172e0185e124ea24f62d
(cherry picked from commit e01d5273185e39a563c7acb15662d9c1549d4b58)
2022-05-03 19:29:46 +00:00
Mike Iovine
3fa77fa51a [SR] Fix quantized linear tests not managing outputs (#75776)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75776

The output was returned directly instead of a clone, so the output of the relevant op would not be managed.
ghstack-source-id: 154935103

Test Plan: CI

Reviewed By: navahgar

Differential Revision: D35633469

fbshipit-source-id: 7b08b7368e0349a12abf8802a4c625ffecdc5abb
(cherry picked from commit 24bed9ba4da39cff7f3b40f5e49dfded2552b373)
2022-04-27 16:38:54 +00:00
Ansha Yu
ee636e2fd1 [sr] remove max_indices argument of embedding_bag when unncessary (#75993)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75993

Strobelight shows copy_ in embedding_bag taking up a lot of time in adfinder_story_post_ad_session_exit_model 334827604_0
{F723683014}

More details in https://fb.quip.com/MKumAjz1YD4 (1f47a80e88)a#temp:C:FPD3 (ecd5567980)e5a0871ae5d481286b511ef7

The last 3 outputs of embedding_bag are unused in the graph: P495814049.
* max_indices output isn't necessary for the main output, so remove it when it's not used in the graph.
* offset2bag is used as an intermediate to calculate the main output, so we don't remove this output even though it's unused in the graph.
* bag_size is used as an intermediate to calculate the main output for MODE_MEAN, so we don't remove this for now.

Test Plan:
`./caffe2/caffe2/fb/predictor/scripts/run_disagg_model_benchmarks.sh 334827604 0 /data/users/ansha/tmp/ads_tail sr_only`

Inputs uploaded to `/mnt/persistent-public/ansha/ads_tail/334827604`

Before:
I0414 10:53:12.261133 1070948 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.121318. Iters per second: 8242.78
        0.11156 ms.    99.0457%. aten::embedding_bag (52 nodes, out variant)

After:
I0418 13:05:10.837378 2354604 PyTorchPredictorBenchLib.cpp:305] PyTorch run finished. Milliseconds per iter: 0.0881273. Iters per second: 11347.2
      0.0789221 ms.    98.7096%. static_runtime::embedding_bag (52 nodes, out variant)

* Ads prod canary:
https://www.internalfb.com/intern/ads/canary/443002539593035806/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_inline_cvr_post_imp -a D35726594`
https://www.internalfb.com/intern/servicelab/602875732/
* 4M test: `servicelab create cogwheel_pyper_inference_fullsync_ads_10x_ctr_mbl_feed_non_mimo -a D35726594`
https://www.internalfb.com/intern/servicelab/1002874745/

Reviewed By: mikeiovine

Differential Revision: D35726594

fbshipit-source-id: 3b71a0822657bf7a23ce37ca899baef9997b011a
(cherry picked from commit fd5e3098c047a1e7d4348e1c97341eecb892536e)
2022-04-22 15:36:35 +00:00
Mike Iovine
b6a4234090 [SR] Fix broken unit test build (#76111)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/76111

https://github.com/pytorch/pytorch/pull/68640 broke our build by porting `cat` structured kernels, not sure how CI didn't catch this
ghstack-source-id: 154335722

Test Plan: CI

Reviewed By: navahgar, ajyu

Differential Revision: D35780296

fbshipit-source-id: 0a262eb06a8d619227e5db10b6a775bf0b2e17c1
(cherry picked from commit aea6fbf9365391011df5211164e3978075d7a5cb)
2022-04-20 18:36:31 +00:00
mikeiovine
98b4a4100d [SR] Add a copy variant for fused_split_and_squeeze
Pull Request resolved: https://github.com/pytorch/pytorch/pull/75660

The outputs of `split_and_squeeze` are passed to `VarStack` in models we care about. `VarStack` has a [fast path](https://www.internalfb.com/code/fbsource/[893193f5277184fd17f4ea3f28fe415a4df37707]/fbcode/caffe2/aten/src/ATen/native/TensorShape.cpp?lines=296-298) for when all of its inputs have the same strides.

Hitting the slow path adds a ton of extra overhead - so much that it's worth it to copy in `split_and_squeeze` and force all of `VarStack`'s inputs to be contiguous so we can take advantage of the fast path.

Differential Revision: [D35513777](https://our.internmc.facebook.com/intern/diff/D35513777/)

**NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35513777/)!

Approved by: https://github.com/hlu1
2022-04-13 20:02:01 +00:00
Mike Iovine
2f98fa9147 [SR] Do not manage tensors that escape scope via container (#74966)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74966

It's clear that we don't want to manage tensors that escape their scope. Previously, we handled this by checking whether the tensor aliased the graph outputs. But there's actually another way to escape scope: by aliasing the wildcard set. The following graph demonstrates this:

```
def forward(self, cond: bool, a, b):
    lst = []
    if cond:
        res = a + b # res should not be managed!!!
        lst.append(res)
    return lst
```

The `if cond:` sub-block returns nothing, but `res` escapes the scope through `lst`.

The fix is simple: we simply have to mark values that alias the wildcard set as an `external_alias_` in `ValueGroup`.

This diff also exposed another issue (via unit tests) in `checkOutputTensorMemoryLeaks`: it assumes that, if a node's `Value*` is managed, the underlying `IValue` must be a tensor. But this is not true after the addition of `to_maybe_copy_out`; TMCO does not produce a tensor in its first output slot if it does not copy.
ghstack-source-id: 153288188

Test Plan: New unit tests cover the problematic case

Reviewed By: navahgar

Differential Revision: D35257087

fbshipit-source-id: 853a761dffe51f2c70720759664dd8dfcd56d1d7
(cherry picked from commit 2c7f519354041975f33626eab6b7f16c2494bbf8)
2022-04-07 19:57:57 +00:00
Mike Iovine
4055d1f653 [SR] Fix StaticRuntime move ctor (#74927)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74927

The move ctor was broken because `BlockRunner` stores a reference to `values_`. When moving runtime instances, the pointer to the root block would be moved, but the reference inside it would not be updated.

Pass `BlockRunner` a raw pointer to the heap-allocated IValues instead to avoid this issue.
ghstack-source-id: 153168602

Test Plan: New unit test/CI

Reviewed By: navahgar

Differential Revision: D35228467

fbshipit-source-id: 04e198b39f898b82677a0e41e1cdf00c2b0c09f3
(cherry picked from commit 03e2c591ac3a907d68025eae9500ed7226dec17e)
2022-04-07 02:16:37 +00:00
Don Jang
85e163c56b [Static Runtime] Fix a bug that aten::full_like reuses a tensor that does not match arguments (#74255)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74255

This change fixes a bug that `aten::full_like` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full_like` are dynamically changed.

Test Plan: - Enhanced `StaticRuntime.FullLike` to cover the modified code path.

Reviewed By: mikeiovine

Differential Revision: D34863639

fbshipit-source-id: ca6d4ee3c039e263cc3a4f643d949cea59381608
(cherry picked from commit ae7db0af5e7d95d866027abc968afcb162fd2ef8)
2022-04-05 22:30:41 +00:00
Raghavan Raman
60bda4d06b [Static Runtime] Fix handling relu in quantized linear relu dynamic op
Summary:
The implementation of `PackedLinearWeightFp16::apply_dynamic_impl` [here](https://www.internalfb.com/code/fbsource/[b1ef7c31f022]/fbcode/caffe2/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp?lines=393) does not handle `relu`. It completely ignores the `ReluFused` boolean template parameter.

At this point, callers of that function handle `relu` explicitly. While the correct thing to do would be to handle the `ReluFused` parameter in that implementation, it is not clear if that semantics is being followed in this code. So, we are handling this in SR's out-variant implementation, until the owner fixes that issue.

This issue resulted in incorrect results when Static Runtime was enabled for the MRS video model.

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=StaticRuntime.QuantizedLinearReluDynamicFp16
```

Reviewed By: mikeiovine

Differential Revision: D35366309

fbshipit-source-id: e60126e3590d52681ceaee5583b81c4c0b5404d9
(cherry picked from commit cabeb96a792339e7dbfd16cb51a3ac9039812137)
2022-04-04 22:16:22 +00:00
Max Podkorytov
11c412a8ec [static-runtime] optimize empty if blocks at runtime (#74987)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74987

Add specializations to `prim::If` operator at runtime to save resources when some of subblocks are empty

Test Plan:
`buck build //caffe2:torch-cpp-cpu`
`buck test //caffe2/benchmarks/static_runtime/...`
Add unit test:
`buck test //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- StaticRuntime.EmptyIfBlock`

Reviewed By: mikeiovine

Differential Revision: D35262952

fbshipit-source-id: 324f88471f33f035f4d8a9b212716530d8e59df2
(cherry picked from commit 2db1b1a6833b1376fa376f54791effc8e12fb77f)
2022-04-01 05:43:33 +00:00
Mike Iovine
2ca66ffb7d [SR] Force split_and_squeeze usage via graph transformation (#74274)
Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/74274

Reviewed By: navahgar

Differential Revision: D34913889

fbshipit-source-id: 655d3f1e5f4c027cb94758b74826a4b4882e9458
(cherry picked from commit bc94d30b69888ca6633a27090a3b87a08919231a)
2022-03-29 19:13:40 +00:00
Mike Iovine
3f37337ed0 [SR] Native implementation for reshape_as (#74585)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74585

Native static runtime for `aten::reshape_as`
ghstack-source-id: 152340038

Test Plan: New unit test

Reviewed By: hlu1

Differential Revision: D35060895

fbshipit-source-id: c4e6f8a04c7df3821c7e654bfaf584e5a72ea701
(cherry picked from commit 6fa596cd866a024b6653239e0e30ddad42de242f)
2022-03-28 17:02:14 +00:00
Mike Iovine
9f2344aa40 [SR] Native implementation for select (#74568)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74568

Native static runtime implementation for `aten::select(Tensor, int, int)` overload
ghstack-source-id: 152340037

Test Plan: New unit test

Reviewed By: hlu1

Differential Revision: D35053900

fbshipit-source-id: c315d4202a4dfca3360325547af805aea33ecc9f
(cherry picked from commit 8683f214dbd8c081365bad727007bbff969b64d0)
2022-03-28 17:02:14 +00:00
Mike Iovine
facdbe6d72 [SR] Native implementation for IntImplicit (#74562)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74562

Add a native implementation for `aten::IntImplicit`, which is similar to `aten::Int` except for a few extra checks it must do
ghstack-source-id: 152340039

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D35052997

fbshipit-source-id: cb2f0faf7c62382e3f13750d8e1280c49c6b9e42
(cherry picked from commit 359c7493f8deaeccebc27e1b6e6e9777850010c1)
2022-03-28 17:02:14 +00:00
Mike Iovine
f5a9c36d0b [SR] Eliminate extra permute ops before aten::sum (#74481)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74481

This diff fixes an interesting performance issue related to `permute_copy`.

We see this pattern frequently:
```
y = torch.permute(x, (0, 2, 1))
z = torch.sum(y, dim=-1)
```

With copy variants off, we get a strided output from `permute`, and we hit this (faster) kernel in `sum`: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L589

But with copy variants on, we get a contiguous output from `permute_copy`, which causes us to hit the slower reduction:
https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cpu/SumKernel.cpp#L597

But the permute is actually unnecessary, we can just statically turn the graph into this to ensure that the fast kernel is hit with copy variants on:
```
z = torch.sum(x, dim=1)
```
ghstack-source-id: 152003888

Reviewed By: navahgar

Differential Revision: D34992319

fbshipit-source-id: 0baf493708ee2180c899814a954d220d88ba1d4f
(cherry picked from commit 797b6beb26325c56012e406e14fe211c0b5d744d)
2022-03-23 23:00:14 +00:00
Don Jang
6294a2eb7f [Static Runtime] Add out variant wrapper for aten::index_select (#74321)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/74321

This change adds out variant wrapper for aten::index_select.

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34928012

fbshipit-source-id: d808363d740d79fa25abee4dd33920fbb6ec7283
(cherry picked from commit ba9b3c0cd4ba240c4a2174f3376580a1880b2b4a)
2022-03-16 23:43:21 +00:00
Mike Iovine
f14a0be302 [SR] Avoid allocating rstd/mean in layer_norm (#73606)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73606

The single-output overload of `layer_norm` internally allocates two tensors. As an optimization, we previously added `static_runtime::layer_norm`. This variant of layer norm had two extra outputs to make the memory planner aware of these extra tensors. But these outputs were unused; it's actually better for us to avoid the allocation and associated computations entirely.
ghstack-source-id: 151394116

Test Plan: Existing unit tests

Reviewed By: hlu1

Differential Revision: D34562131

fbshipit-source-id: c6a6560e60db43b0b100aedc54ea4265acb347de
(cherry picked from commit 3bed52b6f688b93b9b032c3d2b4be68d08d8eb76)
2022-03-15 22:07:11 +00:00
Don Jang
381c0c080f [Static Runtime] Fix a bug that aten::full reuses a tensor that does not match requested one (#73990)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73990

This change fixes a bug that `aten::full` reuses a previously allocated tensor that does not match requested one when arguments to `aten::full` are dynamically changed.

This fix is applied to multiple other out variant wrappers added to Static Runtime, and their fixes are following.

Test Plan: - Added a unittest.

Reviewed By: mikeiovine

Differential Revision: D34768718

fbshipit-source-id: b6958d6601d36253dd5d4f93596fb14055cca9c9
(cherry picked from commit 42acb40d3a1e9359c0f1a3c25481854e5ad344b6)
2022-03-15 16:13:52 +00:00
Don Jang
1b80f609b0 [Static Runtime] Add out variant wrapper for aten::ones_like (#73945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73945

This change adds add out variant wrapper for aten::ones_like.

Test Plan:
- Added a unittest.
- Checked that the op execution got switched to its added out variant (P485330978).

Reviewed By: hlu1

Differential Revision: D34727057

fbshipit-source-id: 5022a7f547d53b0c00459d3959ad3c6e6a8a62d5
(cherry picked from commit 1bec4680e8173654400b165d720a0902136dba0f)
2022-03-14 20:29:58 +00:00
Don Jang
60f22a40ef [Static Runtime] Add out variant wrapper for aten::zeros (#73946)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73946

This change adds an out variant wrapper for aten::zeros.

Test Plan:
- Added a unittest.

- Confirmed that the added out variant gets executed by the unittest (P485324923).

Reviewed By: mikeiovine

Differential Revision: D34725843

fbshipit-source-id: 3ac02ba1914c4a51969381e610d4243df65071ed
(cherry picked from commit 368836d51709b7f96c79114984a95606b29766b1)
2022-03-11 00:52:30 +00:00
Don Jang
87564a1bd7 [Static Runtime] Add native op support for aten::len (#73899)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73899

This change adds native op wrappers to Static Runtime as appears in JIT (https://www.internalfb.com/code/fbsource/[429d233b9beb5e6f60df7304b792e2ff332f6ecd]/fbcode/caffe2/torch/csrc/jit/runtime/register_prim_ops.cpp?lines=613 , search for "aten::len" in that file).

Test Plan: Added unittests, "StaticRuntime.LenWith*", and confirmed they are passing with `V0307 17:39:39.817956 3516654 impl.cpp:1792] Switch to native impl for node: %2 : int = aten::len(%input.1)` per added unittest: P485159811

Reviewed By: mikeiovine

Differential Revision: D34705231

fbshipit-source-id: 916b1f8bdbc92def07bc3f98ce1db22f0f5ce206
(cherry picked from commit 66d2bb9a0a294b55e1bc87ae33f5553b1460e74b)
2022-03-10 02:57:51 +00:00
Mike Iovine
97b20b9b50 [SR][easy] Stack/concat out variants do not segfault on empty inputs (#73704)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73704

Empty inputs are invalid for these ops. But while looking for optimizations, I noticed that these ops just segfault when that happens, which is not helpful for users. Added a check/error message.
ghstack-source-id: 150812721

Test Plan: New unit tests

Reviewed By: hlu1

Differential Revision: D34596954

fbshipit-source-id: 6b22a3a255273920210dcd41f54a9d238bbbcc14
(cherry picked from commit 9e950bfffef36c320638662bdb72f19eb805a228)
2022-03-09 00:55:57 +00:00
Don Jang
71961d37bb [Static Runtime] Add out variant wrapper for aten::ones (#73851)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73851

This change adds an out variant wrapper for aten::ones

Test Plan: Added a unittest

Reviewed By: mikeiovine

Differential Revision: D34557095

fbshipit-source-id: 0d2ac8d0ad6f73067e28c2cebd3b4a018a9b17ae
(cherry picked from commit cc1dda957b8c3acd71de3aa6054c11a9aab5cfa6)
2022-03-07 20:33:22 +00:00
Mike Iovine
818bf361b6 [SR] Fix a kwargs API default value bug (#73681)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73681

Static runtime is rejecting legal calls made with the kwargs API when there are parameters with default values.
ghstack-source-id: 150433627

Test Plan: Added unit test to cover this case

Reviewed By: navahgar, d1jang

Differential Revision: D34588804

fbshipit-source-id: 74d7ef5bee74f9d16b02b0c8ceda4285ea776755
(cherry picked from commit 9c3db19cb45f6022e646deeb1e8056daa04f363f)
2022-03-03 22:31:37 +00:00
Don Jang
bbc59ff2bf [Static Runtime] Introduce StaticNodeInfo to store ProcessedNode's data independent from runtime instances (#73536)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73536

Currently `StaticNodeInfo` class assumes 2 distinct roles that are not too obvious:

1) "template" that contains metadata of an actual executable node by runtime. owned by `StaticModule`

2) fully instanced ones that are owned by `StaticRuntime`.

We currently merge these two usecases into one class, that can be error-prone in case illegal copying happens uncontrollably. Currently, we only copy objects of kind (1) into objects of kind (2) when a `StaticRuntime` instance is created.

To address ths issue, this change introduces `StaticNodeInfo`, a separate class, to distinguishes the aforementioned two usecases in the code more clearly. With this `StaticNodeInfo` is for (1) and `ProcessedNode` is now for (2).

Test Plan: Existing tests

Reviewed By: mikeiovine

Differential Revision: D33985600

fbshipit-source-id: 0c79cea2bf982dd956a35f48eaf6027e5b6e390c
(cherry picked from commit 0d8acc4a2b6eeb3e4af3ad2c99f4cd667680f8df)
2022-03-02 22:33:32 +00:00
Don Jang
539acb29cd [Static Runtime] Fix a broken test & Add an out variant wrapper for mse_loss (#73574)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73574

T113070663 identified a test breakage in `StaticRuntime.autogen__convert_indices_from_csr_to_coo` using `mode/opt`. This change fixes it by using a correct test value.

By generating the out variants for Static Runtime, this change also includes an out variant wrapper for `mse_loss`.

**Generating out variants for Static Runtime**
```
[djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2]  buck run //caffe2/torch/fb/jit:gen_static_runtime_ops
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
  Changed value //fbcode.sanitizer='address-undefined-dev' (was 'thread')
  ... and 13 more. See logs for all changes
Parsing buck files: finished in 0.8 sec
Downloaded 14/25 artifacts, 159.45 Kbytes, 30.0% cache miss (for updated rules)
Building: finished in 1.6 sec (100%) 52/52 jobs, 35/52 updated
  Total time: 2.5 sec
BUILD SUCCEEDED
total grouped native ops: 1501
structured grouped native ops: 540
generated grouped native ops: 137
```

Test Plan:
Ran the broken test in `mode/opt` and confirmed that it passes now.

```
[djang@devbig024.ftw2 ~/fbsource/fbcode/caffe2] buck test mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --exact 'caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo' --run-disabled
Invalidating internal cached state: Buck configuration options changed between invocations. This may cause slower builds.
  Changed value //project.buck_out='buck-out/opt' (was 'buck-out/dev')
  ... and 307 more. See logs for all changes
DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/toolchains/platform009/build/mesa/lib/libGL.so' in library rule 'fbsource//third-party/toolchains:opengl'
DEBUG: /data/users/djang/fbsource/tools/build_defs/fbcode_macros/build_defs/lib/cpp_common.bzl:287:14: Using disallowed linker flag 'arvr/third-party/freeglut/3.0.0/libs/x64-linux/libglut.a' in library rule 'fbsource//third-party/toolchains:GLUT'
I0301 08:28:08.884272 2239319 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000

I0301 08:30:14.751745 2261718 configeratorc.cpp:70] Attempting to get config buck/detectors/bypass_dirty_builds, timeout=10000

Parsing buck files: finished in 10.1 sec
Creating action graph: finished in 6.1 sec
[RE] Metadata: Session ID=[https://fburl.com/b/reSessionID-fa0ba93b-33a1-4e6f-88f8-9f508d2c27c3]
[RE] Waiting on 0 remote actions. Completed 247 actions remotely, action cache hit rate: 0.00%.
Downloaded 13000/17457 artifacts, 463.99 Mbytes, 2.6% cache miss (for updated rules)
Building: finished in 04:16.6 min (100%) 28628/28628 jobs, 28628/28628 updated
  Total time: 04:32.9 min
More details at https://www.internalfb.com/intern/buck/build/c774ff43-5311-49ce-a677-30e3f6afdad1
BUILD SUCCEEDED
Tpx test run coordinator for Facebook. See https://fburl.com/tpx for details.
Running with tpx session id: 16d9b24c-4a63-4671-84b5-690fac0ee086
Trace available for this run at /tmp/tpx-20220301-083049.472831-16d9b24c-4a63-4671-84b5-690fac0ee086/trace.log
RemoteExecution session id: reSessionID-16d9b24c-4a63-4671-84b5-690fac0ee086-tpx
Started reporting to test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685
    ✓ ListingSuccess: caffe2/benchmarks/static_runtime:static_runtime_cpptest : 285 tests discovered (0.425)
    ✓ Pass: caffe2/benchmarks/static_runtime:static_runtime_cpptest - StaticRuntime.autogen__convert_indices_from_csr_to_coo (0.105)
Summary
  Pass: 1
  ListingSuccess: 1
If you need help understanding your runs, please follow the wiki: https://fburl.com/posting_in_tpx_users
Finished test run: https://www.internalfb.com/intern/testinfra/testrun/4503599719295685
```

Reviewed By: mikeiovine

Differential Revision: D34552645

fbshipit-source-id: 36f15b0f29edcb7deb71ba8a6f66ce2532bf7c82
(cherry picked from commit 2329afd8bfc89671cfbd864414e528241e7045fc)
2022-03-02 04:36:31 +00:00
Raghavan Raman
cfd92f2d59 [Static Runtime] Add test that runs NNC fused kernels in parallel (#73256)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73256

This adds a test that executes multiple Static Runtime instances in parallel
when each instances includes a fusion.
ghstack-source-id: 149787403

Test Plan:
```
buck run mode/dev-asan //caffe2/benchmarks/static_runtime:static_runtime_cpptest -- --gtest_filter=CpuFusion.ParallelRuntimes
```

The above test results in an error: P482317015 (when parts of the fix in D34287960 (6d33852685) are backed out)

Reviewed By: mikeiovine

Differential Revision: D34404127

fbshipit-source-id: 95a267e27d74584df90841fe496f909171136981
(cherry picked from commit 57d3ad9a46a24559f6d4f4097bd1b8e0b1f6b077)
2022-02-28 17:44:45 +00:00
Don Jang
fe7e1bd1ce [Static Runtime] Add auto-generated out variant dispatchers (#72603)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72603

This change adds out variant dispatchers generated by the previous diff.

The number of the out variant dispatchers generated by this diff is 133, which increases the out variant coverage by 309% (current: 43, this diff: 133 + 43 = 176). This number is expected to increase a lot as we develop this script further to cover more ops.

Test Plan:
**Unittest**
Confirmed
```
buck run //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```
is passing.

Reviewed By: swolchok

Differential Revision: D33373928

fbshipit-source-id: 4d94d788282f3f313bb36f2f9452edecd9862246
(cherry picked from commit e4ce8b386d1fcc47b86cb9c9016a70e7a31b452c)
2022-02-28 08:39:10 +00:00
Mike Iovine
d398d4d32c [SR] Disable aten::where out variant (#73367)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73367

The op is currently bugged w.r.t. a `condition` input that is not the same shape as the others:

```
def forward(self, cond_1d, x, y):
    shape = [-1] + [1] * (x.dim() - 1)
    cond = cond_1d.view(shape)
    return torch.where(cond, x, y).clone()

Condition:
01
00
[ CPUBoolType{2} ]

A:
06 -9
08 -8
[ CPULongType{2,2} ]

B:
-4 05
-5 -2
[ CPULongType{2,2} ]

Actual:
06 05
-5 -2
[ CPULongType{2,2} ]

Expected:
06 -9
-5 -2
[ CPULongType{2,2} ]
```
ghstack-source-id: 149963254

Test Plan: Unit tests exercise broadcasting

Reviewed By: d1jang

Differential Revision: D34454770

fbshipit-source-id: 6ad4c4ca6893d2b87852a17d437437d99ca94ab4
(cherry picked from commit 7135bc40e9fd930c08f5291b7d6b4902ec30005b)
2022-02-26 01:08:45 +00:00
Raghavan Raman
4838c6dca0 [Static Runtime] Enable all tests to run with TensorExpr fuser (#73263)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/73263

ghstack-source-id: 149784887

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest
```

Reviewed By: d1jang

Differential Revision: D34405943

fbshipit-source-id: 63e345be4bf7a57bf4e1446074e5112d4ed68515
(cherry picked from commit 69a28a6b4ea53bcd88a51c4c36d5205577d84da3)
2022-02-24 00:34:34 +00:00
Raghavan Raman
02afdd54b9 [Static Runtime] Handle fallback graphs that are generated as part of the TE Fuser (#72945)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/72945

ghstack-source-id: 149429754

Test Plan:
```
buck run mode/opt //caffe2/benchmarks/static_runtime:static_runtime_cpptest — --gtest_filter=CpuFusion.FallbackGraph
```

Reviewed By: mikeiovine

Differential Revision: D34283840

fbshipit-source-id: 868bd340a50fe691797164524f2400d07998d304
(cherry picked from commit 80f60f2cc098e0132ececb321a35a1d3132fe676)
2022-02-18 18:34:50 +00:00