Summary:
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
* When an error occurs in a net or it got cancelled, running ops will have the
`Cancel` method called.
* This diff adds `Cancel` method to the `SafeEnqueueBlobsOp`
and `SafeDequeueBlobsOp` to have the call queue->close() to force all the
blocking ops to return.
* Adds unit test that verified the error propagation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44495
Test Plan:
## Unit Test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
```
Reviewed By: dzhulgakov
Differential Revision: D23236088
Pulled By: dahsh
fbshipit-source-id: daa90d9ee32483fb51195e269a52cf5987bb0a5a
Summary:
Make `gcs_cuda_only` and `gcs_gpu_only` return empty device lists if CUDA/GPU(CUDA or RocM) not available
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44578
Reviewed By: walterddr
Differential Revision: D23664227
Pulled By: malfet
fbshipit-source-id: 176b5d964c0b02b8379777cd9a38698c11818690
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44540
Support output type to be fp16 for UniformFill
Reviewed By: jianyuh
Differential Revision: D23558030
fbshipit-source-id: 53a5b2c92cfe78cd11f55e6ee498e1bd682fe4a1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44089
Add support of fp16 as input type in SparseLengthSum/Mean caffe2 operator
Reviewed By: xianjiec
Differential Revision: D23436877
fbshipit-source-id: 02fbef2fde17d4b0abea9ca5d17a36aa989f98a0
Summary:
Expose the interface of `nesterov` of SGD Optimizer from caffe2 to dper.
dper sgd optimizer (https://fburl.com/diffusion/chpobg0h) has referred to NAG sgdoptimizer in caffe2: https://fburl.com/diffusion/uat2lnan. So just need to add the parameter 'nesterov' in dper sgd optimizer.
Analysis of run resutls: N345540.
- train_ne increases as momentum (m) decreases.
- for m=0.95, 0.9: eval_ne is lower with NAG than production (no NAG, m = 0.95).
- for m=0.99: eval_ne with or without NAG is higher than production. It indicates larger variance in validation and overfit in training (lower train_ne).
Test Plan:
1. unit tests:
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_without_nesterov`
`buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_sgd_with_nesterov`
.
1. build dper front end package: `flow-cli canary ads.dper3.workflows.sparse_nn.train --mode opt --entitlement ads_global --run-as-secure-group team_ads_ml_ranking`. The build result (refreshed) is here https://www.internalfb.com/intern/buck/build/2a368b55-d94b-45c1-8617-2753fbce994b. Flow package version is ads_dper3.canary:856b545cc6b249c0bd328f845adeb0d2.
.
2. To build dper back end package: `flow-cli canary dper.workflows.dper3.train --mode opt --entitlement ads_global --run-as-secure-group team_ads_ml_ranking`. The build result (refreshed) is here: https://www.internalfb.com/intern/buck/build/70fa91cd-bf6e-4a08-8a4d-41e41a77fb52. Flow package version is aml.dper2.canary:84123a34be914dfe86b1ffd9925869de.
.
3. Compare prod with NAG-enabled runs:
a) refreshed prod run (m=0.95): f213877098
NAG enabled run (m=0.95): f213887113
.
b) prod run (m=0.9): f214065288
NAG enabled run (m=0.9): f214066319
.
c) prod run (m=0.99): f214065804
NAG enabled run (m=0.99): f214066725
.
d) change date type of nestrov to `bool` and launched a validation run
NAG enabled (m=0.95): f214500597
Reviewed By: ustctf
Differential Revision: D23152229
fbshipit-source-id: 61703ef6b4e72277f4c73171640fb8afc6d31f3c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/44043
To invoke `cancel` from the net instance in Python, we expose it through pybind state.
Reviewed By: dzhulgakov
Differential Revision: D23249660
fbshipit-source-id: 45a1e9062dca811746fcf2e5e42199da8f76bb54
Summary: Exports the operator to PyTorch, to be made into a low-level module.
Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_learning_rate
```
Reviewed By: yf225
Differential Revision: D23545582
fbshipit-source-id: 6b6d9aa6a47b2802ccef0f87c1263c6cc2d2fdf6
Summary: Integrate aot flow with model exporter.
Test Plan:
buck test dper3/dper3_backend/delivery/tests:dper3_model_export_test
replayer test see D23407733
Reviewed By: ipiszy
Differential Revision: D23313689
fbshipit-source-id: 39ae8d578ed28ddd6510db959b65974a5ff62888
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43938
resubmit
Test Plan: unit test included
Reviewed By: mruberry
Differential Revision: D23443493
fbshipit-source-id: 7b68f8f7d1be58bee2154e9a498b5b6a09d11670
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43591
100 randomized inputs vs 50 doesn't change the balance that much but speed up test runtime
Test Plan: CI
Reviewed By: orionr, seemethere
Differential Revision: D23332393
fbshipit-source-id: 7a8ff9127ee3e045a83658a7a670a844f3862987
Summary:
Separate user embeddings and ad embeddings in blobsOrder. New order:
1. meta_net_def
2. preload_blobs
3. user_embeddings (embeddings in remote request only net)
4. ad_embeddings (embeddings in remote other net)
Add a field requestOnlyEmbeddings in meta_net_def to record user_embeddings.
This is for flash verification.
Test Plan:
buck test dper3/dper3_backend/delivery/tests:blob_reorder_test
Run a flow with canary package f211282476
Check the net: n326826, request_only_embeddings are recorded as expected
Reviewed By: ipiszy
Differential Revision: D23008305
fbshipit-source-id: 9360ba3d078f205832821005e8f151b8314f0cf2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/43205
A number of tests that forward to `TestLoadSaveBase.load_save` are all marked as flaky due to them regularly taking much longer to start up than hypothesis' default timeout of 200ms. This diff fixes the problem by removing the timeout for `load_save`. This is alright as these tests aren't meant to be testing the performance of these operators.
I would set the deadline to 60s if I could however it appears the that caffe2 github CI uses a different version of hypothesis that doesn't allow using `dateutil.timedelta` so instead of trying to figure out an approach that works on both I've just removed the deadline time.
I've also tagged all existing tasks WRT these failures.
Differential Revision: D23175752
fbshipit-source-id: 324f9ff034df1ac4874797f04f50067149a6ba48
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42927
added fp16 fusion to net transforms
refactored the transforms as well as glow_transform to get out of opt/custom so that the OSS builds passed
Test Plan: added net runner tests for this
Reviewed By: yinghai
Differential Revision: D23080881
fbshipit-source-id: ee6451811fedfd07c6560c178229854bca29301f
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.
Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: kennyhorror
Differential Revision: D23079841
fbshipit-source-id: 3700e7f2ee0a5a2791850071fdc16e5b054f8400
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42763
add the fp16 fusions as net transforms:
-layernorm fused with mul+add
-swish int8
Test Plan: added unit test, ran flows
Reviewed By: yinghai
Differential Revision: D23002043
fbshipit-source-id: f0b13d51d68c240b05d2a237a7fb8273e996328b
Summary:
Enforce counter value to double type in rowwise_counter.
**Context:**
The existing implementation is using float type for counter value. But due to the precision limit of a floating number [1], we observed that the counter value can't increment beyond 16777216.0 (i.e., the max value is 16777216.0) in our earlier experiments. We decide to enforce double type to avoid this issue.
[1] https://stackoverflow.com/questions/12596695/why-does-a-float-variable-stop-incrementing-at-16777216-in-c
Test Plan:
op test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python/operator_test(f0b0b48c)$ buck test :rowwise_counter_test
Trace available for this run at /tmp/testpilot.20200728-083200.729292.log
TestPilot test runner for Facebook. See https://fburl.com/testpilot for details.
Testpilot build revision cd2638f1f47250eac058b8c36561760027d16add fbpkg f88726c8ebde4ba288e1172a348c7f46 at Mon Jul 27 18:11:43 2020 by twsvcscm from /usr/local/fbprojects/packages/testinfra.testpilot/887/t.par
Discovering tests
Running 1 test
Started new test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - test_rowwise_counter (caffe2.caffe2.python.operator_test.rowwise_counter_test.TestRowWiseCounter) 0.265 1/1 (passed)
✓ caffe2/caffe2/python/operator_test:rowwise_counter_test - main 14.414 (passed)
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7881299364977047
Summary (total time 18.51s):
PASS: 2
FAIL: 0
SKIP: 0
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
optimizer test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/python(7d66fbb9)$ buck test :optimizer_test
Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/7036874434841896
Summary (total time 64.87s):
PASS: 48
FAIL: 0
SKIP: 24
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestMomentumSgd)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestGFtrl)
caffe2/caffe2/python:optimizer_test - test_caffe2_cpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestSparseRAdam)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagradWithCounter)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestAdagrad)
caffe2/caffe2/python:optimizer_test - test_caffe2_gpu_vs_numpy (caffe2.caffe2.python.optimizer_test.TestYellowFin)
caffe2/caffe2/python:optimizer_test - testDense (caffe2.caffe2.python.optimizer_test.TestRowWiseAdagrad)
caffe2/caffe2/python:optimizer_test - testGPUDense (caffe2.caffe2.python.optimizer_test.TestFtrl)
caffe2/caffe2/python:optimizer_test - testSparse (caffe2.caffe2.python.optimizer_test.TestRmsProp)
...and 14 more not shown...
FATAL: 0
TIMEOUT: 0
OMIT: 0
```
param download test
```
ruixliu@devvm1997:~/fbsource/fbcode/caffe2/caffe2/fb/net_transforms/tests(7ef20a38)$ sudo buck test :param_download_test
Finished test run: Finished test run: https://our.intern.facebook.com/intern/testinfra/testrun/6473924481526935
```
e2e flow:
f208394929
f207991149
f207967273
ANP notebook to check the counter value loaded from the flows
https://fburl.com/anp/5fdcbnoi
screenshot of the loaded counter (note that counter max is larger than 16777216.0)
{F250926501}
Reviewed By: ellie-wen
Differential Revision: D22711514
fbshipit-source-id: 426fed7415270aa3f276dda8141907534734337f
Summary:
1. Fix illegal memory access issue for SplitByLengths operator in the CUDA context.
2. Add support to scaling lengths vector for SplitByLengths operator.
3. Add support to test SplitByLengths operator in the CUDA context.
Example for SplitByLengths operator processing scaling lengths vector:
value vector A = [1, 2, 3, 4, 5, 6]
length vector B = [1, 2]
after execution of SplitByLengths operator,
the output should be [1,2] and [3,4,5,6]
Test Plan: buck test mode/dev-nosan caffe2/caffe2/python/operator_test:concat_split_op_test
Reviewed By: kennyhorror
Differential Revision: D22780307
fbshipit-source-id: c5ca60ae16b24032cedfa045a421503b713daa6c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42249
Main change is to bring Caffe2's superior error messages for cuda initialization into c10 and use them in all code paths.
Basic logic:
| Case | Call to device_count() | init_cuda, e.g. allocating tensor |
| -- | -- | -- |
| all good | non-zero | just works |
| no gpus | 0, no warning | throw exception with good message |
| driver issues | 0, produce warning | throw exception with good message |
| out of memory with ASAN | 0, produce warning| throw exception with ASAN message |
Previously, the error thrown from init_cuda was very generic and the ASAN warning (if any) was buried in the logs.
Other clean up changes:
* cache device_count() always in a static variable
* move all asan macros in c10
Test Plan:
Hard to unittest because of build modes. Verified manually that the behavior from the table above holds by running the following script in different modes (ASAN/no-ASAN, CUDA_VISIBLE_DEVICES=):
```
print('before import')
import torch
print('after import')
print('devices: ', torch.cuda.device_count())
x = torch.tensor([1,2,3])
print('tensor creation')
x = x.cuda()
print('moved to cuda')
```
Reviewed By: ngimel
Differential Revision: D22824329
fbshipit-source-id: 5314007313a3897fc955b02f8b21b661ae35fdf5
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42516
att. We need it for some scripts.
Reviewed By: houseroad
Differential Revision: D22918112
fbshipit-source-id: 8a1696ceeeda67a34114bc57cb52c925711cfb4c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42421
Previously, we can only feed shape info from Python with float dtype, and batch based dim type when we do onnxifi from Python. This diff removes this limitation and uses TensorBoundShapes protobuf as a generic shape info struct. This will make the onnxifi interface in Python more flexible.
Reviewed By: ChunliF
Differential Revision: D22889781
fbshipit-source-id: 1a89f3a68c215a0409738c425b4e0d0617d58245
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42381
Introduce new tag to support distributed hogwild.
Reviewed By: boryiingsu
Differential Revision: D20484099
fbshipit-source-id: 5973495589e0a7ab185d3867b37437aa747f408a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42380
[Caffe2] Remove explicitly divide by zero in SpatialBN training mode
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test
Reviewed By: houseroad
Differential Revision: D22873214
fbshipit-source-id: 70b505391b5db02b45fc46ecd7feb303e50c6280
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42219
Introduce a new extra info that is tagged on the forward net for the operators sharing the same input. The effect is that the auto gen sum of gradient for the input will not follow the tag of the operator tags in the forward net. This allow more flexible device allocation.
Test Plan:
# unit test
`./buck-out/gen/caffe2/caffe2/python/core_gradients_test#binary.par -r testMultiUseInputAutoGenSumDevice`
Reviewed By: xianjiec, boryiingsu
Differential Revision: D22609080
fbshipit-source-id: d558145e5eb36295580a70e1ee3a822504dd439a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42151
Previously our Caffe2 SpatialBN op impl was incorrect for computing running_var without unbias coefficent. Actually it should fail the test because the output will be different with CuDNN's output. However, our tests are too weak to find this bug. This diff fix all of them.
Test Plan: buck test mode/dev-nosan //caffe2/caffe2/python/operator_test:spatial_bn_op_test
Reviewed By: houseroad
Differential Revision: D22786127
fbshipit-source-id: db80becb67d60c44faae180c7e4257cb136a266d
Summary:
Found while trying to get RocM Caffe2 CI green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42168
Reviewed By: seemethere
Differential Revision: D22791879
Pulled By: malfet
fbshipit-source-id: 8f7ef9711bdc5941b2836e4c8943bb95c72ef8af
Summary:
Found while trying to get RocM Caffe2 job green
Pull Request resolved: https://github.com/pytorch/pytorch/pull/42169
Reviewed By: seemethere
Differential Revision: D22791896
Pulled By: malfet
fbshipit-source-id: 9df6233876aec5ead056365499bab970aa7e8bdc
Summary: we need this op to avoid the splicing of a dense tensor and then use the Mergesinglescaler op
Test Plan: integrated test with dper2
Differential Revision: D22677523
fbshipit-source-id: f4f9a1f06841b0906ec8cbb435482ae0a89e1721