Summary:
Since hashing is different.
This should be ready to commit now. Running ads nn canaries.
Differential Revision: D4264009
fbshipit-source-id: 3aa16b0c47c61f9a442b0375524c5f1580af5892
Summary: Make xray net_type configub a command line argument
Differential Revision: D4262076
fbshipit-source-id: e2ecb9cd5bee5d6aaebe0ea8d2d4d9b378058cba
Summary: This allows us to serialize things between MKLMemory and a TensorProto.
Reviewed By: dzhulgakov
Differential Revision: D4218044
fbshipit-source-id: 934181493b482cb259c17ff4b17008eac52fd885
Summary:
This examples writes a LMDB database of image data and labels (random). Then it reads them using Caffe2's TensorProtosDBINput and validates the checksums match. This example shows how to coerce image data into TensorProtos and be happy.
Before there was no clear example how to create databases for Caffe2.
Differential Revision: D4263614
fbshipit-source-id: 21e08066899095b4efcc2d23dbc3ede81e75914a
Summary: Switching to Pieter-MPI changed the way we setup network between operators. For syncronizing parameters after a checkpoint load, we run a checkpoint_net that contaiend operators for creating the common world and broadcast operators. Unfortunately this fails when the checkpoint sync is done a second time, because we would have created a duplicate common world. Solution is to separate common world op and broadcast op to init net and the actual broadcasting net, and we run the init net only once. This problem did not arise in the Flow version since I did only one checkpoint loading per operator (process).
Differential Revision: D4251754
fbshipit-source-id: ba030579e651e529e29bbf2d27920075078d8ff9
Summary:
Disclaimer: this is really hacky
Continues a fix from D4218902. The root problem is that DPER builds net incrementally and input_record doesn't support it properly. For not I just manipulate the input record directly. Alisson wants to fix it properly later by allowing set_input_record to accept a superset of current record.
But it should unblock our experimentation.
I'm curious how it's going to look in dper_example world.
Reviewed By: azzolini
Differential Revision: D4255285
fbshipit-source-id: ff65b6f943d705a9b3399035597e2e8ded2e1ff3
Summary:
This adds support for automatic aggregation of sparse gradients. We simply concatenate indices and values (no attempt to deduplicate, since this is already done before feeding into the optimizer). This should support various cases (indices and/or values can be generated by one or more gradient ops, or gradient outputs can be directly passed from inputs).
I tried to minimize the code footprint, but I introduced SparseGradGenMeta because GradGenMeta didn't lend itself very well to be used with sparse gradients.
Reviewed By: dzhulgakov
Differential Revision: D4219788
fbshipit-source-id: 1d074664cffd82a8764e4b1473ada6bc46e6c51a
Summary: adding more methods to the layer representation. The corresponding implementation in DPER is: https://fburl.com/563869364
Differential Revision: D4256583
fbshipit-source-id: 91326b7bb9e960a5bc70b5a13812fce90054eceb
Summary:
When refactoring data parallel model, the division of LR by number of devices was dropped, and thus we ended up effectively multiplying gradients by the number of devices. Thus, we need to scale the LR by 1/numgpus.
Created a test to confirm that data_parallel_model produces exactly same results on different number of gpus, given the total batch size.
Reviewed By: prigoyal
Differential Revision: D4248907
fbshipit-source-id: af21ede113e6ac25f12c556de298cb18974548be
Summary: Basic ops to set/get/check/wait against a StoreHandler.
Differential Revision: D4248059
fbshipit-source-id: cc53061fcc13823d4b9eed6b7c1c346b9e8ec991
Summary:
Add store handler implementation backed by a Redis server.
This allows for easy rendezvous when participating machines have no
access to a shared filesystem.
Differential Revision: D4241715
fbshipit-source-id: 4ce881df3a96af24f7efbb02d1050b3b2b9bc3c0
Summary:
DPER has very strange python ops that play with Workspace - they are somewhat similar to LoadOp/SaveOp, so I guess the semantics is fine.
Thus it makes sense to allow python operators to receive workspace pointer similarly to regular Operators.
I didn't figure out a better way to implement optional argument than just checking the number of args function receives on python side.
Reviewed By: ajtulloch
Differential Revision: D4242943
fbshipit-source-id: d97d4227815b741c8f884cfe254b06d2b56b5a41
Summary:
One more small batch of CHECKs that left in C2 codebase. Most of the left overs
should be in tests/GPU only code.
Reviewed By: Yangqing
Differential Revision: D4243782
fbshipit-source-id: a4a03c116ea8ba16facd2efc135746d5921f19d5
Summary: This diff adds a header file for net_gpu.cc so that the AsyncDAGNet class can be used to create other derived classes.
Reviewed By: ajtulloch
Differential Revision: D4230046
fbshipit-source-id: 379c3ff7ebb7aeeb4294f39e6f5d1ecad48b92f0
Summary:
This makes sure that we have useful CUDA error message in asan mode. Also
made a fb specific task pass by explicitly marking it not asan-able.
Reviewed By: dzhulgakov
Differential Revision: D4243471
fbshipit-source-id: 2ce303b97b3b4728c05575a8e7e21eb5960ecbc7
Summary:
Faster implementation of UniqueOp using google::dense_hash_map, as suggested by dzhulgakov. I haven't benchmarked it precisely but early measurements with my workflow show a significant speed bump (this operation went from using 20% of overall CPU time down to 7%).
I gated the implementation using the "engine" feature, to avoid adding sparsehash as a dependency to caffe2.
Reviewed By: dzhulgakov
Differential Revision: D4219768
fbshipit-source-id: 2f142981e772105b42fffa24afb199ef816f8e0c
Summary: I want to collect tensors over multiple batches and so this operation could become helpful to allocate enough memory from the beginning
Reviewed By: dzhulgakov
Differential Revision: D4216198
fbshipit-source-id: e6b67cc7d80d71455487878da9b6b7a225035085
Summary: Used in the NNPreProc layers. It fails the online training when there is empty batch.
Reviewed By: dzhulgakov
Differential Revision: D4235498
fbshipit-source-id: bde00a011831762e44a3f9bf2190d4b241a06ccc