Revert D26249330: [Gradient Compression] Add a documentation page for DDP communication hooks

Test Plan: revert-hammer Differential Revision: D26249330 (e62aabac43) Original commit changeset: ab973390ddb7 fbshipit-source-id: d508daed76219e7ca588cf7fb38aeaaffc61acfd
2026-05-15 21:00:47 +00:00 · 2021-02-04 22:35:37 -08:00 · 2021-02-04 22:35:37 -08:00 · d3023d86ba
commit d3023d86ba
parent 1065c2d5b6
4 changed files with 62 additions and 153 deletions
--- a/docs/source/ddp_comm_hooks.rst
+++ b/docs/source/ddp_comm_hooks.rst
@ -1,68 +0,0 @@
-DDP Communication Hooks
-=======================
-
-DDP communication hook is a generic interface to control how to communicate
-gradients across workers by overriding the vanilla allreduce in
-`DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#torch.nn.parallel.DistributedDataParallel.>`_.
-A few built-in communication hooks are provided,
-and users can easily apply any of these hooks to optimize communication.
-Besides, the hook interface can also support user-defined communication
-strategies for more advanced use cases.
-
-.. warning ::
-    DDP communication hook is experimental and subject to change.
-
-.. warning ::
-    DDP communication hooks can only support single process single device mode
-    on NCCL backend.
-
-How to Use A Communication Hook?
--------------------------------
-
-To use a communication hook, the user just needs to let the DDP model register
-the hook before the training loop.
-
-.. automethod:: torch.nn.parallel.DistributedDataParallel.register_comm_hook
-
-Default Communication Hooks
---------------------------
-
-Default communication hooks are simple **stateless** hooks, so the input state
-in ``register_comm_hook`` is either a process group or ``None``.
-
-.. automodule:: torch.distributed.algorithms.ddp_comm_hooks.default_hooks
-    :members:
-
-PowerSGD Communication Hook
---------------------------
-
-PowerSGD communication hook is a **stateful** hook used for gradient
-compression, and the user needs to provide a state defined as below.
-The performance is `on par with <https://observablehq.com/@tvogels/powersgd-benchmark>`_
-the implementation in the original `paper <https://arxiv.org/abs/1905.13727>`_.
-
-PowerSGD State
-^^^^^^^^^^^^^^^^
-
-.. currentmodule:: torch.distributed.algorithms.ddp_comm_hooks.powerSGD_hook
-.. autoclass:: PowerSGDState
-
-PowerSGD Hooks
-^^^^^^^^^^^^^^^^
-
-.. warning ::
-    PowerSGD requires an extra copy of gradients for error feedback,
-    which may be infeasible for use cases that have a memory constraint.
-
-.. warning ::
-    The current implementation may cause gradient overflow for FP16 input.
-
-.. autofunction:: powerSGD_hook
-.. autofunction:: batched_powerSGD_hook
-
-Acknowledgements
----------------
-
-Thanks PowerSGD paper author Thijs Vogels for the code review on PowerSGD
-communication hook and the
-`comparison experiments <https://observablehq.com/@tvogels/powersgd-benchmark>`_.
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -71,7 +71,6 @@ Features described in this documentation are classified by release status:
   onnx
   optim
   complex_numbers
-   ddp_comm_hooks
   pipeline
   quantization
   rpc
--- a/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
+++ b/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py
@ -33,24 +33,22 @@ def _orthogonalize(matrix, epsilon=1e-8):


 class PowerSGDState(object):
-    r"""
+    """
    Stores both the gradient compression configs and the internal states for all the gradients during the training.
-    Particularly, ``matrix_approximation_rank`` and ``start_powerSGD_iter`` are the main configs that need to be tuned by the user.
-    Although ``use_error_feedback`` and ``warm_start`` can also be tuned by the user,
+    Particularly, `matrix_approximation_rank` and `start_powerSGD_iter` are the main configs that need to be tuned by the user.
+    Although `use_error_feedback` and `warm_start` can also be tuned by the user,
    they are typically turned on for performance.

-    Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``]
+    Note [Guidance to Tune `matrix_approximation_rank` And `start_powerSGD_iter`]
    ~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-    1. To tune ``matrix_approximation_rank``, the user can increase it from 1 by factors of 2,
+    1) To tune `matrix_approximation_rank`, the user can increase it from 1 by factors of 2,
    until a satisfying accuracy can be reached.
-    The increase of ``matrix_approximation_rank`` can substantially increase the computation costs of the compression.
-    However, the accuracy may not be futher improved beyond a certain ``matrix_approximation_rank`` value.
-
-    2. To tune ``start_powerSGD_iter``, the user can typically start with 10% of total training steps,
+    The increase of `matrix_approximation_rank` can substantially increase the computation costs of the compression.
+    However, the accuracy may not be futher improved beyond a certain `matrix_approximation_rank` value.
+    2) To tune `start_powerSGD_iter`, the user can typically start with 10% of total training steps,
    and increase it until a satisfying accuracy can be reached.
    Deferrring PowerSGD can effectively improve the accuracy,
-    even a relatively small ``matrix_approximation_rank`` is used.
+    even a relatively small `matrix_approximation_rank` is used.
    This is because that, the beginning of training phase is usually very sensitive to inaccurate gradients,
    and compressing gradients too early may make the training quickly take a suboptimal trajectory,
    which can result in an irrecoverable impact on the accuracy.
@ -164,44 +162,38 @@ class PowerSGDState(object):


 def powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:
-    r"""
+    """
    This DDP communication hook implements the original PowerSGD gradient compression
    algorithm described in https://arxiv.org/abs/1905.13727.
    Once gradient tensors are aggregated across all workers, this hook applies
    compression as follows:
-
-    1. Views the input flattened 1D gradient tensor as two groups of per-parameter tensors: high-rank tensors and vector-like rank-1 tensors (for biases).
-
-    2. Handles rank-1 tensors by allreducing them without compression:
-
-        2.1. Allocate contiguous memory for those rank-1 tensors, and allreduces all the rank-1 tensors as a batch, without compression;
-
-        2.2. Copies the individual rank-1 tensors from the contiguous memory back to the input tensor.
-
-    3. Handles high-rank tensors by PowerSGD compression:
-
-        3.1. For each high-rank tensor M, creates two low-rank tensors P and Q for decomposing M,
+    1) Views the input flattened 1D gradient tensor as two groups of per-parameter tensors:
+    high-rank tensors and vector-like rank-1 tensors (for biases).
+    2) Handles rank-1 tensors by allreducing them without compression:
+        2.1) Allocate contiguous memory for those rank-1 tensors,
+        and allreduces all the rank-1 tensors as a batch, without compression;
+        2.2) Copies the individual rank-1 tensors from the contiguous memory back to the input tensor.
+    3) Handles high-rank tensors by PowerSGD compression:
+        3.1) For each high-rank tensor M, creates two low-rank tensors P and Q for decomposing M,
        such that M = PQ^T, where Q is initialized from a standard normal distribution and orthogonalized;
+        3.2) Computes each P in Ps, which is equal to MQ;
+        3.3) Allreduces Ps as a batch;
+        3.4) Orthogonalizes each P in Ps;
+        3.5) Computes each Q in Qs, which is approximately equal to M^TP;
+        3.6) Allreduces Qs as a batch;
+        3.7) Computes each M among all the high-rank tensors, which is approximately equal to PQ^T.

-        3.2. Computes each P in Ps, which is equal to MQ;
-
-        3.3. Allreduces Ps as a batch;
-
-        3.4. Orthogonalizes each P in Ps;
-
-        3.5. Computes each Q in Qs, which is approximately equal to M^TP;
-
-        3.6. Allreduces Qs as a batch;
-
-        3.7. Computes each M among all the high-rank tensors, which is approximately equal to PQ^T.
-
-    Note that this communication hook enforces vanilla allreduce for the first ``state.start_powerSGD_iter`` iterations.
+    Note that this communication hook enforces vanilla allreduce for the first `state.start_powerSGD_iter` iterations.
    This can not only allow the user to have a finer tuning over the tradeoff between speedup and accuracy,
    but also help abstract away some complexity of the internal optimization of DDP for future communication hook developers.

+    TODO(wayi@): The above procedure does two matmul+allreduce steps per iteration --
+    one left multiplication and one right multiplication.
+    For warm-start, can take one such step at a time, and alternate between them.
+
    Args:
        state (PowerSGDState): State information to configure the compression rate and support error feedback, warm start, etc.
-            To tune the compression configs, see Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``].
+            To tune the compression configs, see Note [Guidance to Tune `matrix_approximation_rank` And `start_powerSGD_iter`].
        bucket (dist._GradBucket): Bucket that stores a 1D flattened gradient tensor that batches multiple per-variable tensors.
            Note that since DDP comm hook only supports single process single device mode at this time,
            only exactly one tensor is stored in this bucket.
@ -210,9 +202,9 @@ def powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:
        Future handler of the communication, which updates the gradients in place.

    Example::
-        >>> state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1, start_powerSGD_iter=10)
+        state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1, start_powerSGD_iter=10)
        >>> ddp_model.register_comm_hook(state, powerSGD_hook)
-    """  # noqa
+    """
    process_group = state.process_group
    group_to_use = process_group if process_group is not None else dist.group.WORLD
    world_size = group_to_use.size()
@ -382,10 +374,6 @@ def powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:
        for tensor, p, q in zip(high_rank_tensors, ps, qs):
            torch.matmul(tensor.t(), p, out=q)

-        # TODO: The above procedure does two matmul+allreduce steps per iteration --
-        # one left multiplication and one right multiplication.
-        # For warm-start, can take one such step at a time, and alternate between them.
-
        # Allreduce Qs.
        return [
            dist.all_reduce(
@ -424,44 +412,40 @@ def powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:


 def batched_powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:
-    r"""
+    """
    This DDP communication hook implements a simplified PowerSGD gradient compression
    algorithm described in https://arxiv.org/abs/1905.13727.
    Once gradient tensors are aggregated across all workers, this hook applies
    compression to the flattened input tensor that batches per-parameter tensors as follows:
+    1) Views the input flattened 1D gradient tensor as a square-shaped tensor M with 0 paddings;
+    2) Creates two low-rank tensors P and Q for decomposing M,
+    such that M = PQ^T, where Q is initialized from a standard normal distribution and orthogonalized;
+    2) Computes P, which is equal to MQ;
+    3) Allreduces P;
+    4) Orthogonalizes P;
+    5) Computes Q, which is approximately equal to M^TP;
+    6) Allreduces Q;
+    7) Computes M, which is approximately equal to PQ^T.
+    8) Truncates the input tensor to the original length.

-    1. Views the input flattened 1D gradient tensor as a square-shaped tensor M with 0 paddings;
-
-    2. Creates two low-rank tensors P and Q for decomposing M, such that M = PQ^T, where Q is initialized from a standard normal distribution and orthogonalized;
-
-    3. Computes P, which is equal to MQ;
-
-    4. Allreduces P;
-
-    5. Orthogonalizes P;
-
-    6. Computes Q, which is approximately equal to M^TP;
-
-    7. Allreduces Q;
-
-    8. Computes M, which is approximately equal to PQ^T.
-
-    9. Truncates the input tensor to the original length.
-
-    This variant is faster than :meth:`powerSGD_hook` that runs layer-wise gradient compression,
-    but it usually results in a much lower accuracy, unless ``matrix_approximation_rank`` in the state is 1.
-    Increasing ``matrix_approximation_rank`` may not necessarily increase the accuracy,
+    This variant is faster than `powerSGD_hook` that runs layer-wise gradient compression,
+    but it usually results in a much lower accuracy, unless `matrix_approximation_rank` in the state is 1.
+    Increasing `matrix_approximation_rank` may not necessarily increase the accuracy,
    because batching per-parameter tensors without column/row alignment can destroy low-rank structure.
-    Therefore, the user should always consider :meth:`powerSGD_hook` first,
-    and only consider this variant when a satisfying accuracy can be achieved when ``matrix_approximation_rank`` is 1.
+    Therefore, the user shoud always consider `powerSGD_hook` first,
+    and only consider this variant when a satisfying accuracy can be achieved when `matrix_approximation_rank` is 1.

-    Note that this communication hook enforces vanilla allreduce for the first ``state.start_powerSGD_iter`` iterations.
+    Note that this communication hook enforces vanilla allreduce for the first `state.start_powerSGD_iter` iterations.
    This can not only allow the user to have a finer tuning over the tradeoff between speedup and accuracy,
    but also help abstract away some complexity of the internal optimization of DDP for future communication hook developers.

+    TODO(wayi@): The above procedure does two matmul+allreduce steps per iteration --
+    one left multiplication and one right multiplication.
+    For warm-start, can take one such step at a time, and alternate between them.
+
    Args:
        state (PowerSGDState): State information to configure the compression rate and support error feedback, warm start, etc.
-            To tune the compression configs, see Note [Guidance to Tune ``matrix_approximation_rank`` And ``start_powerSGD_iter``].
+            To tune the compression configs, see Note [Guidance to Tune `matrix_approximation_rank` And `start_powerSGD_iter`].
        bucket (dist._GradBucket): Bucket that stores a 1D flattened gradient tensor that batches multiple per-variable tensors.
            Note that since DDP comm hook only supports single process single device mode at this time,
            only exactly one tensor is stored in this bucket.
@ -470,9 +454,9 @@ def batched_powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:
        Future handler of the communication, which updates the gradients in place.

    Example::
-        >>> state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1)
+        state = PowerSGDState(process_group=process_group, matrix_approximation_rank=1)
        >>> ddp_model.register_comm_hook(state, batched_powerSGD_hook)
-    """  # noqa
+    """
    process_group = state.process_group
    group_to_use = process_group if process_group is not None else dist.group.WORLD
    world_size = group_to_use.size()
@ -579,11 +563,6 @@ def batched_powerSGD_hook(state: PowerSGDState, bucket) -> torch.futures.Future:
            out=state.q_memory_dict[bucket_index],
        )

-
-        # TODO: The above procedure does two matmul+allreduce steps per iteration --
-        # one left multiplication and one right multiplication.
-        # For warm-start, can take one such step at a time, and alternate between them.
-
        return [
            dist.all_reduce(
                state.q_memory_dict[bucket_index], group=group_to_use, async_op=True
--- a/torch/nn/parallel/distributed.py
+++ b/torch/nn/parallel/distributed.py
@ -1021,13 +1021,12 @@ class DistributedDataParallel(Module):
        parameter syncs while running Distributed DataParallel training.

        Args:
-            state (object): Passed to the hook to maintain any state information during the training process.
-                            Examples include error feedback in gradient compression,
-                            peers to communicate with next in GossipGrad, etc.
-
-                            It is locally stored by each worker
-                            and shared by all the gradient tensors on the worker.
-            hook (callable): Averages gradient tensors across workers and defined as:
+            state (object): state is passed to the hook and can be used to maintain
+                            and update any state information that users would like to
+                            maintain as part of the training process. Examples: error
+                            feedback in gradient compression, peers to communicate with
+                            next in GossipGrad etc.
+            hook (callable): is defined as:
                             hook(state: object, bucket: dist._GradBucket) -> torch.futures.Future:

                             This function is called once the bucket is ready. The