Summary:
DPer example have been creating multiple copies of the transform config in net
defition till this moment, that resulted in the fact that I've hit the limit of
ProtoBuf (64MB) for a certain Task requests (especially visible because of the
ValidationPipeline that I was adding).
After this diff we're going to store SigridTransforms in one instance per
machine for training (or 1 instance per reading).
Difference in sizes of the plans for some simple SparseNN model ~30 MB (even including the fact that second model have validation plan as well).
TODO: Do similar logic for NNPreProc as well (it's also pretty large).
Reviewed By: dzhulgakov
Differential Revision: D4441441
fbshipit-source-id: 4452dd86a4dc49b2c7f5b7642f443aed5720b047
Summary:
Spatial Softmax allows specifying locations that are not counted for the loss. If none of the locations are counted, this resulted in NaNs, and headache. This diff fixes that by explicitly handling these cases.
+ assertion for label blob dimension(0)
Created a new test as well.
Differential Revision: D4442939
fbshipit-source-id: 8641bfad2a994e517ca3eda39345380a6ca1ba50
Summary:
When testing the code, a couple of issues arised:
- we need to have different name for last layer than the preprocessed model, otherwise a shape assertion is created
- preprocess_noaugmentation still needs to do a crop for images larger than 227x227, otherwise things fail.
Reviewed By: viswanathgs
Differential Revision: D4442700
fbshipit-source-id: 05f54e7f17c266280f5ba5bb57af1721fe30df12
Summary:
It helps to develop scripts locally (when working outside of Flow). One doesn't have to rerun the script in order to catch exception in the debugger / add a print statement. (Flow does this kind of thing automatically)
Usage example:
```
if __name__ == '__main__':
workspace.GlobalInit(['caffe2', '--caffe2_log_level=2'])
from caffe2.python.utils import DebugMode
DebugMode.enable()
DebugMode.run(main)
```
Reviewed By: Yangqing
Differential Revision: D4424096
fbshipit-source-id: 73f418c80f581820e70139df7e166981e4d8c55f
Summary:
Some tweaks, hopefully getting us to 0.98 MAP
- no cropping for test dataset (as per patrick)
- spatialBN momentum 0.1 (default is 0.9)
Also added some additional logging and reduced frequency of running of test net and logging.
Reviewed By: viswanathgs
Differential Revision: D4439790
fbshipit-source-id: 700705b811a5fc8c7139a265de96db646605ca5a
Summary:
In this diff :
[1] Change the output from generating all paths from root to labels to TreeProto.
TreeProto itself is required by inference and we can use hsm_util to get the
paths from TreeProto.
[2] Fix hsm_util index assigment.
Differential Revision: D4416731
fbshipit-source-id: 657d8b9b4df6fa30c9f92d391cf7e07b5c5db1f8
Summary: Change labels indices range to be in the range [0, num_classes[
Differential Revision: D4416685
fbshipit-source-id: b16ca8539fd538ad62bf1298dbad3f1553956241
Summary:
Minor bug in D4426513 - bias is added
as input blob always. Running it on xray throws "RuntimeError: [enforce fail at operator.cc:25] blob
!= nullptr. op Conv: Encountered a non-existing input blob:
caffe.SpatialConvolution_0_b"
Reviewed By: Yangqing
Differential Revision: D4429231
fbshipit-source-id: 0d3905ea6e87128ec1aa9d0f0a2f43126b1069b1
Summary:
Turns out xray models have some independent Scale layers (with bias) besides
the Conv-Scale pairs. We could still fuse it with previous layers with some
work, but for simplicity, including Add op followed by Mul for bias if needed.
We could revisit optimizations layer fusion in the future once we have
something working for xray.
Reviewed By: Yangqing
Differential Revision: D4427266
fbshipit-source-id: ef7d8677ccd7d10dbd20759eeed378d9bc4522d1
Summary: Now that we direct support group convolution, this will no longer be needed. I also took the chance to add dilated convolution and also optional bias.
Reviewed By: prigoyal
Differential Revision: D4426513
fbshipit-source-id: eb2bb0aa619f8ff5f732512570f736bc59cd57dd
Summary:
This is a handy tool for amortizing expensive operators (e.g.
distributed communication, some heavier kernel launches, etc) over a
lot of small blobs (e.g. all the biases in a network). We can just
coalesce these small blobs in-place into a single blob, act on them in
operators, etc as if they are non-coalsed (passing them as inputs to
operators, etc), and then finally for heavier operators, just work on
the coalesced blob that contains each of these units.
I named it UnsafeCoalesce since it introduces blob aliasing, which
needs care for work like memory management, graph rewriting as in
memonger, etc.
Reviewed By: Yangqing
Differential Revision: D3557149
fbshipit-source-id: 09cff4459b84270fe9e1da3b4a168fd66d01f795
Summary: Failing fast instead of swallowing the bias term.
Differential Revision: D4419130
fbshipit-source-id: 98ce0af9a20adecfb027ffe8293ff69910873abc
Summary:
Simple tool similar to caffe_translator_test.py for conversion from caffe to
caffe2. The differences are:
There are a couple of issues that need to be fixed as mentioned in
https://our.intern.facebook.com/intern/tasks?t=15424761, especially related to
the 'legacy_pad' field in conv op.
Differential Revision: D4407146
fbshipit-source-id: ec641f6d7e0cf6cdf2eca21f058b4451635d4a56
Summary: Data paralell model has a sanity check that ensures that operators inputs/outputs do not cross device boundaries. This failed when the operator was a CPU-only operator (such as the new AccuracyOp version). This fixes that.
Reviewed By: prigoyal
Differential Revision: D4417841
fbshipit-source-id: 9bc4e7a2074a544ca4db69ecf24183bbd41f84ca
Summary: Github import didn't work and the manual import lost some files.
Reviewed By: Yangqing
Differential Revision: D4408509
fbshipit-source-id: ec8edb8c02876410f0ef212bde6847a7ba327fe4
Summary:
It looks like for the types that are created directly through type(...)
function call, we don't store the strong references anywhere. As a result
a GC call in Python might/or might not clean up these classes depending on the
phase of the moon and other random things. This results in a fact that in some
cases simple layers as a Relu might disappear.
cat_shame
Reviewed By: xianjiec
Differential Revision: D4396289
fbshipit-source-id: ba4e9b7ef54ee43349853b0acc3d3f40c74e4d73
Summary:
(Ignore the convolution-op related changes, they will be later patched separately)
This diff ignores work from latest few weeks:
- some refactoring of the flow ops
- no_bias setting
- MAP computation (instead of accuracy) for OC
- adaptive learning rate for Xray concepts
- various small bug fixes
Reviewed By: viswanathgs
Differential Revision: D4329500
fbshipit-source-id: 000d4fd22ec408af5290480c788eb86546bff52e
Summary: DivOp missed a gradient for CUDA, so implemented it. Also added operator test.
Differential Revision: D4396638
fbshipit-source-id: 9949e47aa3735bb418a0db003e2b2f4896056a71
Summary:
This diff brings us to roughly par with Torch on ResNet memory usage. On batch size 32, Resnet-50 took 7497MiB, after this 5010 MiB. This will thus allow us to handle 64 images / GPU, or 256 images / 4 GPUs.
In addition, I added a special argument to DagNet that causes it to run only one thread for the first iteration. This is needed since there are allocations on the first iteration's backward pass due to gradient sharing, and this will cause NCCL to deadlock.
The sharing of gradient buffers requires inferring which gradients can share memory (i.e that they are not used concurrently). Previous memonger code uses topological sort, but rbgirshick showed that it does not work with tree-like models. Thus, I wrote a new optimization algorithm based on DFS. It takes about 0.25 secs / GPU on resnet-50, so is clearly fast enough.
Module data_parallel_model supports this feature natively.
Reviewed By: prigoyal
Differential Revision: D4363209
fbshipit-source-id: 73b11e7610438098bb11bff0af8075ab0cf2c0f1
Summary:
Adds a thread pool for image decode, and optional GPU-based data conversion, mean subtraction and std division
Closes https://github.com/caffe2/caffe2/pull/56
Reviewed By: Yangqing
Differential Revision: D4341326
Pulled By: bwasti
fbshipit-source-id: 6485616ea7d212c7701274a40fae912db30dff4a
Summary:
this normalizes the sparse gradient, so that the "effective learning rate" of each sparse parameter will NOT be affected by the number of examples in a batch that "use" this sparse parameter.
experiment shows it help convergence (about 0.1% better train NE): https://fburl.com/1230747813683956. It's not conclusive yet, and we still need to do more experiments. But this diff adds it as an option, and does not change the default behavior, so we can get this in first.
Differential Revision: D4367283
fbshipit-source-id: 49ea80dfa9ea776ff4160e220cf6c86593521607
Summary: This diff adds a gflag for specifying the path for htrace span log files. This flag is used by the net types `HTraceDAGNet` and `HTraceAsyncDAGNet`.
Differential Revision: D4366849
fbshipit-source-id: 56038d3d64a3fd5ab363feda86a19a6f2496971c
Summary:
Rewrite D3993337 based on new stack.
Comparing to the old one, we need more readers to achieve the same speed. But so far the speed is the same and the new bottleneck is the write bandwidth of trainer. Model quality is the same as the base.
Reviewed By: azzolini
Differential Revision: D4310803
fbshipit-source-id: 6d04ae8040c1ee7caa9aea5287f054e73fbe325a
Summary: As title. We want to have request_only net which runs on user_only sparse features. Submitting to get early feedback.
Reviewed By: dzhulgakov
Differential Revision: D4282783
fbshipit-source-id: 71241bf5444550075884c788c2da4783659bc1e0
Summary: Recently a PR landed that removed asserts of trying to feed float64 to FeedBlob for GPUs and changed to a warning. Thus the test testing assertions were given started to fail. Removing it.
Reviewed By: Yangqing
Differential Revision: D4363780
fbshipit-source-id: d9e222c309302243138d4ff3c223c711a4d2052d
Summary:
I was testing perf difference between naive group conv and cudnn group conv. I am doing no_bias conv and added support for that in naive implementation
although its deprecated, i thought it would be nice to have working things in our code
Differential Revision: D4363168
fbshipit-source-id: 29719013d79b449fd359884709c7a1195be51ae3
Summary: As per discussion in D4355529
Reviewed By: prigoyal
Differential Revision: D4362162
fbshipit-source-id: 795fcf1507235a7dc3c7a10b0453037936d057aa
Summary:
Essentially, when number of pairs is around 1000, then only positive samples in the list gets a massive boost from all the negative examples. This diff normalizes the gradient and the loss with the number of pairs.
This diff also adds protection against NaN and more logging to help debug.
Reviewed By: kdub0
Differential Revision: D4359782
fbshipit-source-id: 7240344ddb1f2f670d1eec1b03e7f6e413f3dfcc
Summary:
It used to be that only the cudnn engine supports it, and now it should be
fully supported by any conv engine.
To ignore bias, simply use a convolution op that has two inputs instead of
3. The gradient operator will automatically figure out that it does not
compute the bias gradient.
Reviewed By: prigoyal
Differential Revision: D4354183
fbshipit-source-id: cf71b6289a254d15a6a663a85df63fbbaec3702b