pytorch/caffe2/python/layers/batch_normalization.py

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

from caffe2.python import schema
from caffe2.python.layers.layers import ModelLayer

import numpy as np


class BatchNormalization(ModelLayer):
    def __init__(
        self,
        model,
        input_record,
        name='batch_normalization',
        scale_optim=None,
        bias_optim=None,
        momentum=0.9,
        order='NCHW',
        scale_init_value=1.0,
        **kwargs
    ):
        super(BatchNormalization, self).__init__(
            model, name, input_record, **kwargs)

        assert isinstance(input_record, schema.Scalar), "Incorrect input type"

        self.input_shape = input_record.field_type().shape

        if len(self.input_shape) == 3:
            if order == "NCHW":
                input_dims = self.input_shape[0]
            elif order == "NHWC":
                input_dims = self.input_shape[2]
            else:
                raise ValueError("Please specify a correct order")
        else:
            assert len(self.input_shape) == 1, (
                "This layer supports only 4D or 2D tesnors")
            input_dims = self.input_shape[0]

        self.output_schema = schema.Scalar(
            (np.float32, self.input_shape),
            self.get_next_blob_reference('output')
        )

        self.momentum = momentum
        self.order = order

        self.scale = self.create_param(param_name='scale',
                                       shape=[input_dims],
                                       initializer=('ConstantFill', {'value': scale_init_value}),
                                       optimizer=scale_optim)
        self.bias = self.create_param(param_name='bias',
                                       shape=[input_dims],
                                       initializer=('ConstantFill', {'value': 0.0}),
                                       optimizer=bias_optim)
        self.rm = self.create_param(param_name='running_mean',
                                       shape=[input_dims],
                                       initializer=('ConstantFill', {'value': 0.0}),
                                       optimizer=model.NoOptim)
        self.riv = self.create_param(param_name='running_inv_var',
                                       shape=[input_dims],
                                       initializer=('ConstantFill', {'value': 1.0}),
                                       optimizer=model.NoOptim)

    def _add_ops(self, net, is_test, out_blob=None):
        original_input_blob = self.input_record.field_blobs()
        input_blob = net.NextScopedBlob('expand_input')
        if len(self.input_shape) == 1:
            input_blob = net.ExpandDims(original_input_blob,
                                        dims=[2, 3])
        else:
            input_blob = original_input_blob[0]

        if out_blob is None:
            bn_output = self.output_schema.field_blobs()
        else:
            bn_output = out_blob
        if is_test:
            output_blobs = bn_output
        else:
            output_blobs = bn_output + [self.rm, self.riv,
                                        net.NextScopedBlob('bn_saved_mean'),
                                        net.NextScopedBlob('bn_saved_iv')]

        net.SpatialBN([input_blob, self.scale,
                       self.bias, self.rm, self.riv],
                      output_blobs,
                      momentum=self.momentum,
                      is_test=is_test,
                      order=self.order)

        if len(self.input_shape) == 1:
            net.Squeeze(bn_output,
                        bn_output,
                        dims=[2, 3])

    def add_train_ops(self, net):
        self._add_ops(net, is_test=False)

    def add_eval_ops(self, net):
        self._add_ops(net, is_test=True)

    def add_ops(self, net):
        self.add_eval_ops(net)
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`from __future__ import absolute_import`
			`from __future__ import division`
			`from __future__ import print_function`
			`from __future__ import unicode_literals`

Adding parameter sharing API to Dper2 Summary: To achive this, I modified the blob name scheme defined in a layer. Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc within the same scope). Now I change it to scope/fc/w and scope/fc_auto_0/w. That is, we rely on the uniqueness of the scoped layer name to define names for blobs. I also overwrote the create_param method in LayerModelHelper to let it use the resolved name for blobs given the sharingparameter context. There are some details such as making the initializer more structured that I need to finalize. Reviewed By: kennyhorror Differential Revision: D5435132 fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455 2017-08-03 07:17:36 +00:00			`from caffe2.python import schema`
			`from caffe2.python.layers.layers import ModelLayer`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00
			`import numpy as np`


			`class BatchNormalization(ModelLayer):`
			`def __init__(`
			`self,`
			`model,`
			`input_record,`
			`name='batch_normalization',`
			`scale_optim=None,`
			`bias_optim=None,`
			`momentum=0.9,`
			`order='NCHW',`
Scale init for batch-norm and layer-norm (#31983) Summary: Per discussion with Fei Tian, we need to add a `scale_init_value` to scale down the output of normalization such as batch-norm and layer-norm. Currently we have `sparse_normalization_options` to normalize embedding pooling output. By default, scale = 1.0, we found it's better to set scale from 0.025 to 0.1 https://fb.quip.com/MiKUAibEaYhH Besides, I am removing the tags from normalizers because it makes more sense to calculate norm ops in distributed trainers, not ps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31983 Test Plan: Testing LN and BN after sum-pooling -- baseline f160348514 LN: f160348609 BN: f160348710 {F226106518} Layer norm after sum-pooling fwd_net https://fburl.com/sa4j207n Layer norm after dot-prod fwd_net https://fburl.com/twggwyvb ## Unit Tests Testing normalization after pooling ``` buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_batch_normalization buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_batch_normalization buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_layer_normalization buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_layer_normalization ``` Testing normalization after dot-prod ``` buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_batch_norm buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_layer_norm ``` Differential Revision: D19277618 Pulled By: SilunWang fbshipit-source-id: ea323e33e3647ba55d2e808ef09d94ad7b45b934 2020-01-10 19:54:07 +00:00			`scale_init_value=1.0,`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`**kwargs`
			`):`
			`super(BatchNormalization, self).__init__(`
			`model, name, input_record, **kwargs)`

			`assert isinstance(input_record, schema.Scalar), "Incorrect input type"`

			`self.input_shape = input_record.field_type().shape`

			`if len(self.input_shape) == 3:`
			`if order == "NCHW":`
			`input_dims = self.input_shape[0]`
			`elif order == "NHWC":`
Report bugs in BatchNormalization, the dimension is wrong for second order Summary: The number input dimension for NHWC should be the last dimension C. Since batch size is omitted, it should be 2 instead of 3. Reviewed By: chocjy Differential Revision: D5418538 fbshipit-source-id: a6939a863817b7566198ea2a665a1d236a2cf63d 2017-07-14 01:22:01 +00:00			`input_dims = self.input_shape[2]`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`else:`
			`raise ValueError("Please specify a correct order")`
			`else:`
			`assert len(self.input_shape) == 1, (`
			`"This layer supports only 4D or 2D tesnors")`
			`input_dims = self.input_shape[0]`

			`self.output_schema = schema.Scalar(`
			`(np.float32, self.input_shape),`
Adding parameter sharing API to Dper2 Summary: To achive this, I modified the blob name scheme defined in a layer. Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc within the same scope). Now I change it to scope/fc/w and scope/fc_auto_0/w. That is, we rely on the uniqueness of the scoped layer name to define names for blobs. I also overwrote the create_param method in LayerModelHelper to let it use the resolved name for blobs given the sharingparameter context. There are some details such as making the initializer more structured that I need to finalize. Reviewed By: kennyhorror Differential Revision: D5435132 fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455 2017-08-03 07:17:36 +00:00			`self.get_next_blob_reference('output')`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`)`

			`self.momentum = momentum`
			`self.order = order`

Adding parameter sharing API to Dper2 Summary: To achive this, I modified the blob name scheme defined in a layer. Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc within the same scope). Now I change it to scope/fc/w and scope/fc_auto_0/w. That is, we rely on the uniqueness of the scoped layer name to define names for blobs. I also overwrote the create_param method in LayerModelHelper to let it use the resolved name for blobs given the sharingparameter context. There are some details such as making the initializer more structured that I need to finalize. Reviewed By: kennyhorror Differential Revision: D5435132 fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455 2017-08-03 07:17:36 +00:00			`self.scale = self.create_param(param_name='scale',`
			`shape=[input_dims],`
Scale init for batch-norm and layer-norm (#31983) Summary: Per discussion with Fei Tian, we need to add a `scale_init_value` to scale down the output of normalization such as batch-norm and layer-norm. Currently we have `sparse_normalization_options` to normalize embedding pooling output. By default, scale = 1.0, we found it's better to set scale from 0.025 to 0.1 https://fb.quip.com/MiKUAibEaYhH Besides, I am removing the tags from normalizers because it makes more sense to calculate norm ops in distributed trainers, not ps. Pull Request resolved: https://github.com/pytorch/pytorch/pull/31983 Test Plan: Testing LN and BN after sum-pooling -- baseline f160348514 LN: f160348609 BN: f160348710 {F226106518} Layer norm after sum-pooling fwd_net https://fburl.com/sa4j207n Layer norm after dot-prod fwd_net https://fburl.com/twggwyvb ## Unit Tests Testing normalization after pooling ``` buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_batch_normalization buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_batch_normalization buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_sparse_pooling_layer_normalization buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test_4 -- test_dense_sparse_pooling_layer_normalization ``` Testing normalization after dot-prod ``` buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_batch_norm buck test caffe2/caffe2/fb/dper/layer_models/tests/split_1:sparse_nn_test -- test_last_layer_use_layer_norm ``` Differential Revision: D19277618 Pulled By: SilunWang fbshipit-source-id: ea323e33e3647ba55d2e808ef09d94ad7b45b934 2020-01-10 19:54:07 +00:00			`initializer=('ConstantFill', {'value': scale_init_value}),`
Adding parameter sharing API to Dper2 Summary: To achive this, I modified the blob name scheme defined in a layer. Before it was scope/fc_w and scope/fc_w_auto_0 (if there is another fc within the same scope). Now I change it to scope/fc/w and scope/fc_auto_0/w. That is, we rely on the uniqueness of the scoped layer name to define names for blobs. I also overwrote the create_param method in LayerModelHelper to let it use the resolved name for blobs given the sharingparameter context. There are some details such as making the initializer more structured that I need to finalize. Reviewed By: kennyhorror Differential Revision: D5435132 fbshipit-source-id: a0525f5ea0977e255dd5ea765b38913f5951d455 2017-08-03 07:17:36 +00:00			`optimizer=scale_optim)`
			`self.bias = self.create_param(param_name='bias',`
			`shape=[input_dims],`
			`initializer=('ConstantFill', {'value': 0.0}),`
			`optimizer=bias_optim)`
			`self.rm = self.create_param(param_name='running_mean',`
			`shape=[input_dims],`
			`initializer=('ConstantFill', {'value': 0.0}),`
			`optimizer=model.NoOptim)`
			`self.riv = self.create_param(param_name='running_inv_var',`
			`shape=[input_dims],`
			`initializer=('ConstantFill', {'value': 1.0}),`
			`optimizer=model.NoOptim)`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00
implement drelu and unittest Summary: In this revision, I mainly implemented the DRelu activation. See https://arxiv.org/pdf/1706.06978v1.pdf for details. To sum up, different from standard relu and purely, which divide the scope into two parts with boundary at zero, DRelu calculate another value p to divide the activation into two part. P is the softmax value of the output of Batch Normalization. For f(x)=x part in relu, you can find similar patten in f(x)=px, and for f(x)=0 part in rely, you can find similar pattern in f(x)=a(1-p)x, in which a is a parameter to tune. Drelu activation result is the sum of these two parts, f(x) = a(1-p)x + px. To implement DRelu, I take BatchNormalization as super class and then use the above formula for computation. In order to allow users to choose activation methods, which usually takes place when calling add_mlp function in processor_util.py, I pass the parameter transfer in model_option from UI to the details, just as what dropout do. Currently, I place it in extra_option, but can modify it if AML team needs to redesign the UI. I also add units test for DRelu. We check the shape of output and also do the numeric unit tests. For Unit test, I first check the numeric value of BatchNormalization, since there is no similar test before. I then compute the value of DRelu outputs and compare the results with current DRelu layer. Reviewed By: chocjy Differential Revision: D5341464 fbshipit-source-id: 896b4dcc49cfd5493d97a8b448401b19e9c80630 2017-07-20 18:37:39 +00:00			`def _add_ops(self, net, is_test, out_blob=None):`
			`original_input_blob = self.input_record.field_blobs()`
			`input_blob = net.NextScopedBlob('expand_input')`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`if len(self.input_shape) == 1:`
implement drelu and unittest Summary: In this revision, I mainly implemented the DRelu activation. See https://arxiv.org/pdf/1706.06978v1.pdf for details. To sum up, different from standard relu and purely, which divide the scope into two parts with boundary at zero, DRelu calculate another value p to divide the activation into two part. P is the softmax value of the output of Batch Normalization. For f(x)=x part in relu, you can find similar patten in f(x)=px, and for f(x)=0 part in rely, you can find similar pattern in f(x)=a(1-p)x, in which a is a parameter to tune. Drelu activation result is the sum of these two parts, f(x) = a(1-p)x + px. To implement DRelu, I take BatchNormalization as super class and then use the above formula for computation. In order to allow users to choose activation methods, which usually takes place when calling add_mlp function in processor_util.py, I pass the parameter transfer in model_option from UI to the details, just as what dropout do. Currently, I place it in extra_option, but can modify it if AML team needs to redesign the UI. I also add units test for DRelu. We check the shape of output and also do the numeric unit tests. For Unit test, I first check the numeric value of BatchNormalization, since there is no similar test before. I then compute the value of DRelu outputs and compare the results with current DRelu layer. Reviewed By: chocjy Differential Revision: D5341464 fbshipit-source-id: 896b4dcc49cfd5493d97a8b448401b19e9c80630 2017-07-20 18:37:39 +00:00			`input_blob = net.ExpandDims(original_input_blob,`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`dims=[2, 3])`
implement drelu and unittest Summary: In this revision, I mainly implemented the DRelu activation. See https://arxiv.org/pdf/1706.06978v1.pdf for details. To sum up, different from standard relu and purely, which divide the scope into two parts with boundary at zero, DRelu calculate another value p to divide the activation into two part. P is the softmax value of the output of Batch Normalization. For f(x)=x part in relu, you can find similar patten in f(x)=px, and for f(x)=0 part in rely, you can find similar pattern in f(x)=a(1-p)x, in which a is a parameter to tune. Drelu activation result is the sum of these two parts, f(x) = a(1-p)x + px. To implement DRelu, I take BatchNormalization as super class and then use the above formula for computation. In order to allow users to choose activation methods, which usually takes place when calling add_mlp function in processor_util.py, I pass the parameter transfer in model_option from UI to the details, just as what dropout do. Currently, I place it in extra_option, but can modify it if AML team needs to redesign the UI. I also add units test for DRelu. We check the shape of output and also do the numeric unit tests. For Unit test, I first check the numeric value of BatchNormalization, since there is no similar test before. I then compute the value of DRelu outputs and compare the results with current DRelu layer. Reviewed By: chocjy Differential Revision: D5341464 fbshipit-source-id: 896b4dcc49cfd5493d97a8b448401b19e9c80630 2017-07-20 18:37:39 +00:00			`else:`
			`input_blob = original_input_blob[0]`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00
implement drelu and unittest Summary: In this revision, I mainly implemented the DRelu activation. See https://arxiv.org/pdf/1706.06978v1.pdf for details. To sum up, different from standard relu and purely, which divide the scope into two parts with boundary at zero, DRelu calculate another value p to divide the activation into two part. P is the softmax value of the output of Batch Normalization. For f(x)=x part in relu, you can find similar patten in f(x)=px, and for f(x)=0 part in rely, you can find similar pattern in f(x)=a(1-p)x, in which a is a parameter to tune. Drelu activation result is the sum of these two parts, f(x) = a(1-p)x + px. To implement DRelu, I take BatchNormalization as super class and then use the above formula for computation. In order to allow users to choose activation methods, which usually takes place when calling add_mlp function in processor_util.py, I pass the parameter transfer in model_option from UI to the details, just as what dropout do. Currently, I place it in extra_option, but can modify it if AML team needs to redesign the UI. I also add units test for DRelu. We check the shape of output and also do the numeric unit tests. For Unit test, I first check the numeric value of BatchNormalization, since there is no similar test before. I then compute the value of DRelu outputs and compare the results with current DRelu layer. Reviewed By: chocjy Differential Revision: D5341464 fbshipit-source-id: 896b4dcc49cfd5493d97a8b448401b19e9c80630 2017-07-20 18:37:39 +00:00			`if out_blob is None:`
			`bn_output = self.output_schema.field_blobs()`
			`else:`
			`bn_output = out_blob`
Add batch normalization layer Summary: As desc. Reviewed By: xianjiec Differential Revision: D5077230 fbshipit-source-id: f73cdedac6d9a3542f8ef829b54fb4c713dcafd0 2017-05-26 23:46:04 +00:00			`if is_test:`
			`output_blobs = bn_output`
			`else:`
			`output_blobs = bn_output + [self.rm, self.riv,`
			`net.NextScopedBlob('bn_saved_mean'),`
			`net.NextScopedBlob('bn_saved_iv')]`

			`net.SpatialBN([input_blob, self.scale,`
			`self.bias, self.rm, self.riv],`
			`output_blobs,`
			`momentum=self.momentum,`
			`is_test=is_test,`
			`order=self.order)`

			`if len(self.input_shape) == 1:`
			`net.Squeeze(bn_output,`
			`bn_output,`
			`dims=[2, 3])`

			`def add_train_ops(self, net):`
			`self._add_ops(net, is_test=False)`

			`def add_eval_ops(self, net):`
			`self._add_ops(net, is_test=True)`

			`def add_ops(self, net):`
			`self.add_eval_ops(net)`