Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/49322
In some cases async execution might loose dependencies (Alias like ops) or produce suboptimal scheduling when there is an option which parts to schedule first. Example of the later behavior can happen in ModelParallel training where copy can get lower priority compared to the rest of the execution on the given GPU, which will caused other GPUs to starve.
This operator allows to address these issues by introducing extra explicit dependencies between ops.
Test Plan:
Unit-test/
E2E testing in the future diffs.
Reviewed By: xianjiec
Differential Revision: D24933471
fbshipit-source-id: 1668994c7856d73926cde022378a99e1e8db3567
Summary:
+ Add ArgMin support to Caffe2 to PyTorch converter
+ Using hypothesis to parameterize different conditions for test
Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Reviewed By: houseroad
Differential Revision: D25016203
fbshipit-source-id: 94489fcf1ed3183ec96f9796a5b4fb348fbde5bc
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48240
Adds the support for converting the SparseLengthsSum4BitRowwiseSparse operator from caffe2 to pytorch as a part of c2_pt_converter
Test Plan:
Added a unit tested
buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Tests Passed :
https://our.intern.facebook.com/intern/testinfra/testrun/2251799856412296
Reviewed By: houseroad
Differential Revision: D25067833
fbshipit-source-id: 45cbc331ca35bee27e083714e65a1e87a2a2d2e0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48340
This changes the context managed classes from using a decorator to define them to using inheritance. Inheritance allows the python static type checking to work correctly.
```
context.define_context()
class Bar(object): ...
context.define_context(allow_default=True)
class Foo(object): ...
```
becomes
```
class Foo(context.Managed): ...
class Bar(context.DefaultManaged): ...
```
Behavior differences:
* arg_name has been removed since it's not used anywhere
* classes need to call `super()` in `__enter__/__exit__` methods if they override (none do)
This also defines a context.pyi file to add types for python3. python2 support should not be affected
Test Plan:
ci
buck test //caffe2/caffe2/python:context_test //caffe2/caffe2/python:checkpoint_test
Reviewed By: dongyuzheng
Differential Revision: D25133469
fbshipit-source-id: 16368bf723eeb6ce3308d6827f5ac5e955b4e29a
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48407
T79817692: Fused8BitRowwiseQuantizedToFloat operator support for c2_pt_converter.
Also refactored some repeated code from the existing test functions. (Initial commit only has refactoring.)
Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Reviewed By: bugra
Differential Revision: D25069936
fbshipit-source-id: 72f6a845a1b4639b9542c6b230c8cd74b06bc5a0
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48404
On bento this is printing a lot of msgs like (see N408483 if you're an internal user)
```
W1123 120952.322 schema.py:811] Scalar should be considered immutable. Only call Scalar.set() on newly created Scalar with unsafe=True. This will become an error soon.
```
And it's ignoring the log level I set at global level. Removing this line unless this is super important.
Test Plan: build a local dper package and verify
Differential Revision: D25163808
fbshipit-source-id: 338d01c82b4e67269328bbeafc088987c4cbac75
Summary: is_external_input doesn't check if the lookup tables are valid. Calling .Proto() should invalidate all lookup tables and have them rebuilt on call to any methods depending on them. This adds this check to is_external_input.
Test Plan: internal unit tests
Reviewed By: dzhulgakov, esqu1
Differential Revision: D25100464
fbshipit-source-id: d792dec7e5aa9ffeafda88350e05cb757f4c4831
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47023
DeviceType pretty clearly only needs 1 byte. DeviceIndex only needs 1 byte given that machines don't have anywhere near 255 GPUs in them as far as I know.
ghstack-source-id: 116901430
Test Plan: Existing tests, added assertion to catch if my assumption about DeviceIndex is incorrect
Reviewed By: dzhulgakov
Differential Revision: D24605460
fbshipit-source-id: 7c9a89027fcf8eebd623b7cdbf6302162c981cd2
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47768
This stores the next ID for a given NextName(prefix, output_id) so repeated calls to NextName are significantly faster. This accounts for ~65% of time spent for large models.
Test Plan:
buck test //caffe2/caffe2/python/...
will launch canary job before landing to ensure no regressions + confirm speedup
Reviewed By: dzhulgakov
Differential Revision: D24876961
fbshipit-source-id: 668d73060d800513bc72d7cd405a47d15c4acc34
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/48021
Extending operator schema check for simple memonger to dag memonger as well. As part of this a fix is being made to handle inplace ops (having at least one output name same as input blob). Earlier all the output blobs from ops were being treated as shareable but it failed assertion of external input blobs with the same name not allowed to share.
Test Plan: Added corresponding unit tests
Reviewed By: hlu1
Differential Revision: D24968862
fbshipit-source-id: b6679a388a82b0d68f65ade64b85560354aaa3ef
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47718
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs.
As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.
Test Plan: Added corresponding tests in memonger_test.py . Could not find unit tests in c++ version of memonger.
Reviewed By: hlu1
Differential Revision: D24872010
fbshipit-source-id: 1dc99b2fb52b2bc692fa4fc0aff6b7e4c5e4f5b0
Summary: Added the MatMul operator for caffe2
Test Plan: buck test //caffe2/torch/fb/model_transform/c2_convert:c2_pt_converter_test
Reviewed By: bugra
Differential Revision: D24920937
fbshipit-source-id: 7ba09ba0439cb9bd15d6a41fd8ff1a86d8d11437
Summary: To support min/max/mean/std, SummarizeOp need to skip size checking (similar to the LpNorm error mentioned above) and accept multiple types
Test Plan:
unit test:
`buck test //caffe2/caffe2/fb/tensorboard/tests:tensorboard_accumulate_histogram_op_test`
https://our.intern.facebook.com/intern/testinfra/testrun/1407375057859572
`buck test //caffe2/caffe2/fb/tensorboard/tests:tensorboard_accumulate_histogram_op_test --stress-runs 1000`
https://our.intern.facebook.com/intern/testinfra/testrun/2533274832166362
Reviewed By: cryptopic
Differential Revision: D24605507
fbshipit-source-id: fa08372d7c9970083c38abd432d4c86e84fb10e0
Summary:
Distributed Inference splits a predict net into multiple parts, part0 being the main part which contains ops to make remote calls to other parts. part0 predict net may contain AsyncIf ops to optimize rpc call usage. AsyncIf ops have internal nets which may refer to memongered blobs. This change handles AsyncIf ops to update internal nets to refer to memongered blobs. Here is one reference part0 predict net with AsyncIf ops: https://www.internalfb.com/intern/paste/P145812115/
As part of this change, I am also updating dag memonger traversal to always start from root op, i.e. ops with 0 in degree. Earlier logic will start traversing ops based on input head blobs and if one of the head inputs is getting used in a non-root op which gets visited before its parent, the traversal will throwing assertion error here: https://fburl.com/diffusion/ob110s9z . Almost for all the distributed inference part0 nets, it was throwing this assertion error.
Reviewed By: hlu1
Differential Revision: D24346771
fbshipit-source-id: ad2dd2e63f3e822ad172682f6d63f8474492255d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47541
The profiler has guided us to `schema.py`. Since these `Field`s are used everywhere and in huge quantities, we can easily make some optimizations system wide by adding `__slots__`.
From StackOverflow, benefits include:
* faster attribute access.
* space savings in memory.
Read more: https://stackoverflow.com/a/28059785/
Reviewed By: dzhulgakov
Differential Revision: D24771078
fbshipit-source-id: 13f6064d367440069767131a433c820eabfe931b
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47542
The previous way of doing `Field.__init__(self, [])` is just wrong. Switching to Python2 compatible way: `super(ObjectName, self).__init__(...)`
Reviewed By: dzhulgakov
Differential Revision: D24771077
fbshipit-source-id: d6798c72090c0264b6c583602cae441a1b14587c
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47530
`Net.AddExternalInput` should raise if there are duplicate names. The previous code would only raise if the addition of duplicates was in separate calls, but not if it was in the same call.
Test Plan:
Added two new regression tests
```
✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.622)
✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicate (caffe2.caffe2.python.core_test.TestExternalInputs) (9.639)
✓ Pass: caffe2/caffe2/python:core_test - testSetInputRecordWithoutBlobs (caffe2.caffe2.python.core_test.TestExternalInputs) (9.883)
✓ Pass: caffe2/caffe2/python:core_test - testAddExternalInputShouldRaiseIfDuplicateInSameCall (caffe2.caffe2.python.core_test.TestExternalInputs) (10.153)
```
Test trained 2 models. No issues
f230755456
f230754926
Reviewed By: dzhulgakov
Differential Revision: D24763586
fbshipit-source-id: c87088441d76f7198f8b07508b2607aec13521ed
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47512
I deleted the last line of `__init__` -- `self._field_offsets.append(offset)` -- and the unittests didn't fail.
So this diff is to improve test coverage.
Test Plan:
```
✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetEmptyParent (caffe2.caffe2.python.schema_test.TestField) (8.225)
✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetFieldOffsetsIfNoChildren (caffe2.caffe2.python.schema_test.TestField) (8.339)
✓ Pass: caffe2/caffe2/python:schema_test - testInitShouldSetFieldOffsets (caffe2.caffe2.python.schema_test.TestField) (8.381)
```
Reviewed By: dzhulgakov
Differential Revision: D24767188
fbshipit-source-id: b6ce8cc96ecc61768b55360e0238f7317a2f18ea
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/47475
This improves the core.Net cloning/init performance by quite a bit. It makes set_input_record run in linear time instead of O(n) by checking the external_input map instead of regenerating the external inputs each time and then iterating over it.
Test Plan: unit tests + canary runs
Reviewed By: dzhulgakov
Differential Revision: D24765346
fbshipit-source-id: 92d9f6dec158512bd50513b78675174686f0f411
Summary:
Add `last_n_window_collector` as C2 supports and PyTorch currently does not have this operator: https://www.internalfb.com/intern/diffusion/FBS/browsefile/master/fbcode/caffe2/caffe2/operators/last_n_window_collector.cc?lines=139
## Problem that we are solving
This operator works on multiple pieces of data and collects last `n` element that has been seen.
If you have the following pieces of data that has been passed around:
```
[1, 2, 3, 4]
[5, 6, 7]
[8, 9, 10, 11]
```
for 3 times and the number of collector is given to be 6. The expected result is:
```
[6, 7, 8, 9, 10, 11]
```
What this means is that, almost like we need a FIFO(First in First Out) mechanism where as we are passing this data through the collector, we will be pushing some other data at the end.
In this particular example, in the first pass(the data is `[1, 2, 3, 4]`) , we hold `[1, 2, 3, 4]` in the queue as our queue size is 6.
In the second pass(the data is `[5, 6, 7]`), we hold `[2, 3, 4, 5, 6, 7]` in the queue and since 1 is inserted the last, it will drop due to the size limitation of the queue.
In the third pass(the data is `[8, 9, 10, 11]`), we hold `[6, 7, 8, 9, 10, 11]` in the queue and `2,3,4,5` are dropped due the the size of the queue.
For multidimension case, when we have the following data:
```
[[1, 2], [2, 3], [3, 4], [4, 5]]
[[5, 6], [6, 7], [7, 8]]
[[8, 9], [9, 10], [10, 11], [11, 12]]
```
and our queue size is 6.
In the first pass, we will have ` [[1, 2], [2, 3], [3, 4], [4, 5]]`
In the second pass, we will have `[2, 3], [3, 4], [4, 5]] [[5, 6], [6, 7], [7, 8]]`
In the third pass, we will have `[6, 7], [7, 8]] [[8, 9], [9, 10], [10, 11], [11, 12]]`
### The implementation
I am using FIFO queue in Python which is in the collections library. This accepts `maxlen` argument which can be used to set the size of the queue.
I am using last n indices of the tensor through list indices and in this operator, I am not doing copy.
In the test plan, I have both single dimension tensors as well as multi-dimension tensors.
### Benchmark
I used various different configurations and added a benchmark test. PyTorch implementation is much master than Caffe2 implementation:
#### CPU Benchmark
```
torch_response.median
0.00019254473969340324
caffe_response.median
0.00030233583599794657
```
#### GPU Benchmark
```
torch_response.mean
0.000081007429903838786
caffe_response.mean
0.00010279081099724863
```
Test Plan:
### For CPU:
```
buck test //caffe2/torch/fb/sparsenn:test
```
### For GPU:
- Used an on-demand machine and did the following commands:
```
jf get D24435544
buck test mode/opt //caffe2/torch/fb/sparsenn:test
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/4222124688138052/
Reviewed By: dzhulgakov, radkris-git
Differential Revision: D24435544
fbshipit-source-id: 8193b4746b20f2a4920fd4d41271341045cdcee1
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46590
This operator is very similar to LengthsToRanges but doesn't pack the offsets next to the original lengths.
Reviewed By: yf225
Differential Revision: D24419746
fbshipit-source-id: aa8b014588bb22eced324853c545f8684086c4e4
Summary: I was reading/looking into how LocalSession works and realized that the workspace type being passed around was the bound function on TaskGroup instead of the actual type. This meant that all workspaces for localsession would always be global, because they'd never match the private workspace type.
Test Plan: <not sure, could use some suggestions>
Reviewed By: cryptopic
Differential Revision: D24458428
fbshipit-source-id: 0f87874babe9c1ddff25b5363b443f9ca37e03c1
Summary:
We've been seeing a lot of Hypothesis timeouts and from profiling a few of the failing tests one of the contributing factors is really slow grad checker. In short, it launches the whole op for each of the input elements so the overall complexity is O(numel^2) at least.
This applies a very unscientific hack to just run grad check on the first and last few elements. It's not ideal, but it's better than flaky tests. One can still explicitly opt in with the env var.
Reviewed By: malfet
Differential Revision: D23336220
fbshipit-source-id: f04d8d43c6aa1590c2f3e72fc7ccc6aa674e49d2
Summary: Similar to If operator, AsyncIf also contains nets in args. It needs the same handling.
Test Plan:
New unit test test_control_op_remap
`buck test caffe2/caffe2/python:core_test`
Also it worked end to end in prototype of dist bulk eval workflow f226680903
Reviewed By: yyetim
Differential Revision: D24451775
fbshipit-source-id: 50594e2ab9bb457329ed8da7b035f7409461b5f6
Summary:
Follow-up of https://github.com/pytorch/pytorch/issues/46461 with a similar goal
Makes them more readable and possibly faster. Care has to be taken because `map` applies the function immediately while `(x for x in xs)` is a generator expression which gets evaluated later. This is a benefit in some cases where it is not required to actually create the list of values in memory (e.g. when passing to `tuple` or `extend` or `join`)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46462
Reviewed By: zou3519
Differential Revision: D24422343
Pulled By: ezyang
fbshipit-source-id: 252e33499c92ac0b15238f2df32681dbbda2b237
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/46457
Wanted to see if using CopyMatrix specialized for float that uses mkl_somatcopy can be faster but it wasn't. Still want to check in benchmark that can be used later.
Test Plan: .
Reviewed By: dskhudia
Differential Revision: D24345901
fbshipit-source-id: d3e68dbb560e3138fda11c55789cd41bc0715c6d
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45551
The FP16 version of SparseNormalize op in Caffe2 is missing. This Diff adds FP16 support to unblock MC process of adding FP16 to Dper3.
Check https://fb.quip.com/L0T2AXGwUY3n#EReACAeifk3 .
One question is whether the pure FP16 Sparse Normalized op will affect the accuracy? Maybe we should do it in FP32 domain.
ghstack-source-id: 114184398
Test Plan:
```
buck run mode/opt //caffe2/caffe2/python/operator_test:sparse_normalize_test
```
```
buck run mode/opt -c python.package_style=inplace mode/no-gpu //caffe2/caffe2/python/benchmarks:sparse_normalize_benchmark -- --fp16
```
Reviewed By: jspark1105
Differential Revision: D24005618
fbshipit-source-id: 8b918ec4063fdaafa444779b95206ba2b7b38537
Summary: This diff adds a string equality checking operator.
Test Plan: Unit tests
Differential Revision: D24042344
fbshipit-source-id: c8997c6130e3438f2ae95dae69f76978e2e95527
Summary: `__repr__` calling self.tasks() ends up marking the instance as "used", which doesn't seem appropriate. I was debugging a value being passed around and then ran into `Cannot add Task to an already used TaskGroup.` because the value had been logged once.
Test Plan:
Added a unit test -- didn't see a clean public method to test it, but I'm happy to add one if that makes sense.
Will wait for sandcastle to trigger everything else; I'm not at all familiar with this code so any other recommendations would be great!
Reviewed By: cryptopic
Differential Revision: D23541198
fbshipit-source-id: 5d1ec674a1ddaedf113140133b90e0da6afa7270
Summary: Currently GetSingleArgument is overflowing since it's expecting an int instead of an int64 when using a 1cycle (hill policy) annealing schedule
Test Plan:
unittest
buck test caffe2/caffe2/python/operator_test:learning_rate_op_test
Differential Revision: D23938169
fbshipit-source-id: 20d65df800d7a0f1dd9520705af31f63ae716463
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45315
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45314
in D23858329 (721cfbf842), we put PriorCorrectionCalibrationPrediction unit test in OSS file which causes test failure issue in public trunk.
this diff moves it to FB only test file.
Test Plan:
```
buck test //caffe2/caffe2/python/operator_test:torch_integration_test -- test_gather_ranges_to_dense_op
buck test //caffe2/caffe2/fb/python/operator_test:torch_integration_test -- test_prior_correct_calibration_prediction_op
```
all pass.
Reviewed By: houseroad
Differential Revision: D23899012
fbshipit-source-id: 1ed97d8702e2765991e6caf5695d4c49353dae82
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/45178
## Motivation
* To be able to make C2 ops cancellable so we can safely exit.
* Some C2 operators are now blocking thus being non-cancellable. If an error
occurs we need to be able to safely stop all net execution so we can throw
the exception to the caller.
## Summary
* Adds a hypothesis test for queue ops cancellation.
Test Plan:
## Unit test added to verify that queue ops propagate errors
```
buck test caffe2/caffe2/python:hypothesis_test
buck test caffe2/caffe2/python:hypothesis_test -- test_safe_dequeue_blob__raises_exception_when_hang --stress-runs 1000
```
```
Summary
Pass: 1000
ListingSuccess: 1
```
Reviewed By: d4l3k
Differential Revision: D23847576
fbshipit-source-id: 2fc351e1ee13ea8b32d976216d2d01dfb6fcc1ad