stable-baselines3/docs/modules/ddpg.rst

.. _ddpg:

.. automodule:: stable_baselines3.ddpg


DDPG
====

`Deep Deterministic Policy Gradient (DDPG) <https://spinningup.openai.com/en/latest/algorithms/ddpg.html>`_ combines the
trick for DQN with the deterministic policy gradient, to obtain an algorithm for continuous actions.


.. note::

  As ``DDPG`` can be seen as a special case of its successor :ref:`TD3 <td3>`,
  they share the same policies and same implementation.


.. rubric:: Available Policies

.. autosummary::
    :nosignatures:

    MlpPolicy
    CnnPolicy


Notes
-----

- Deterministic Policy Gradient: http://proceedings.mlr.press/v32/silver14.pdf
- DDPG Paper: https://arxiv.org/abs/1509.02971
- OpenAI Spinning Guide for DDPG: https://spinningup.openai.com/en/latest/algorithms/ddpg.html


Can I use?
----------

-  Recurrent policies: ❌
-  Multi processing: ❌
-  Gym spaces:


============= ====== ===========
Space         Action Observation
============= ====== ===========
Discrete      ❌      ✔️
Box           ✔️       ✔️
MultiDiscrete ❌      ✔️
MultiBinary   ❌      ✔️
============= ====== ===========


Example
-------

.. code-block:: python

  import gym
  import numpy as np

  from stable_baselines3 import DDPG
  from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise

  env = gym.make("Pendulum-v0")

  # The noise objects for DDPG
  n_actions = env.action_space.shape[-1]
  action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))

  model = DDPG("MlpPolicy", env, action_noise=action_noise, verbose=1)
  model.learn(total_timesteps=10000, log_interval=10)
  model.save("ddpg_pendulum")
  env = model.get_env()

  del model # remove to demonstrate saving and loading

  model = DDPG.load("ddpg_pendulum")

  obs = env.reset()
  while True:
      action, _states = model.predict(obs)
      obs, rewards, dones, info = env.step(action)
      env.render()

Results
-------

PyBullet Environments
^^^^^^^^^^^^^^^^^^^^^

Results on the PyBullet benchmark (1M steps) using 6 seeds.
The complete learning curves are available in the `associated issue #48 <https://github.com/DLR-RM/stable-baselines3/issues/48>`_.


.. note::

  Hyperparameters of :ref:`TD3 <td3>` from the `gSDE paper <https://arxiv.org/abs/2005.05719>`_ were used for ``DDPG``.


*Gaussian* means that the unstructured Gaussian noise is used for exploration,
*gSDE* (generalized State-Dependent Exploration) is used otherwise.

+--------------+--------------+--------------+--------------+
| Environments | DDPG         | TD3          | SAC          |
+==============+==============+==============+==============+
|              | Gaussian     | Gaussian     | gSDE         |
+--------------+--------------+--------------+--------------+
| HalfCheetah  | 2272 +/- 69  | 2774 +/- 35  | 2984 +/- 202 |
+--------------+--------------+--------------+--------------+
| Ant          | 1651 +/- 407 | 3305 +/- 43  | 3102 +/- 37  |
+--------------+--------------+--------------+--------------+
| Hopper       | 1201 +/- 211 | 2429 +/- 126 | 2262 +/- 1   |
+--------------+--------------+--------------+--------------+
| Walker2D     | 882 +/- 186  | 2063 +/- 185 | 2136 +/- 67  |
+--------------+--------------+--------------+--------------+


How to replicate the results?
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Clone the `rl-zoo repo <https://github.com/DLR-RM/rl-baselines3-zoo>`_:

.. code-block:: bash

  git clone https://github.com/DLR-RM/rl-baselines3-zoo
  cd rl-baselines3-zoo/


Run the benchmark (replace ``$ENV_ID`` by the envs mentioned above):

.. code-block:: bash

  python train.py --algo ddpg --env $ENV_ID --eval-episodes 10 --eval-freq 10000


Plot the results:

.. code-block:: bash

  python scripts/all_plots.py -a ddpg -e HalfCheetah Ant Hopper Walker2D -f logs/ -o logs/ddpg_results
  python scripts/plot_from_file.py -i logs/ddpg_results.pkl -latex -l DDPG


Parameters
----------

.. autoclass:: DDPG
  :members:
  :inherited-members:

.. _ddpg_policies:

DDPG Policies
-------------

.. autoclass:: MlpPolicy
  :members:
  :inherited-members:

.. autoclass:: stable_baselines3.td3.policies.TD3Policy
  :members:
  :noindex:

.. autoclass:: CnnPolicy
  :members:
  :noindex:
Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00			`.. _ddpg:`

			`.. automodule:: stable_baselines3.ddpg`


			`DDPG`
			`====`

			`Deep Deterministic Policy Gradient (DDPG) <https://spinningup.openai.com/en/latest/algorithms/ddpg.html>`_ combines the
			`trick for DQN with the deterministic policy gradient, to obtain an algorithm for continuous actions.`


Update documentation (#199) * Update doc and add new example * Add save/load replay buffer example * Add save format + export doc * Add example for get/set parameters * Typos and minor edits * Add results sections * Add note about performance * Add DDPG results * Address comments * Fix grammar/wording Co-authored-by: Anssi "Miffyli" Kanervisto <kaneran21@hotmail.com> 2020-10-28 08:55:16 +00:00			`.. note::`

			As ``DDPG`` can be seen as a special case of its successor :ref:`TD3 <td3>`,
			`they share the same policies and same implementation.`


Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00			`.. rubric:: Available Policies`

			`.. autosummary::`
			`:nosignatures:`

			`MlpPolicy`
Update documentation (#199) * Update doc and add new example * Add save/load replay buffer example * Add save format + export doc * Add example for get/set parameters * Typos and minor edits * Add results sections * Add note about performance * Add DDPG results * Address comments * Fix grammar/wording Co-authored-by: Anssi "Miffyli" Kanervisto <kaneran21@hotmail.com> 2020-10-28 08:55:16 +00:00			`CnnPolicy`
Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00

			`Notes`
			`-----`

			`- Deterministic Policy Gradient: http://proceedings.mlr.press/v32/silver14.pdf`
			`- DDPG Paper: https://arxiv.org/abs/1509.02971`
			`- OpenAI Spinning Guide for DDPG: https://spinningup.openai.com/en/latest/algorithms/ddpg.html`



			`Can I use?`
			`----------`

			`- Recurrent policies: ❌`
			`- Multi processing: ❌`
			`- Gym spaces:`


			`============= ====== ===========`
			`Space Action Observation`
			`============= ====== ===========`
			`Discrete ❌ ✔️`
			`Box ✔️ ✔️`
			`MultiDiscrete ❌ ✔️`
			`MultiBinary ❌ ✔️`
			`============= ====== ===========`


			`Example`
			`-------`

			`.. code-block:: python`

			`import gym`
			`import numpy as np`

			`from stable_baselines3 import DDPG`
			`from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise`

Add custom objects support + bug fix (#336) * Add support for custom objects * Add python 3.8 to the CI * Bump version * PyType fixes * [ci skip] Fix typo * Add note about slow-down + fix typos * Minor edits to the doc * Bug fix for DQN * Update test * Add test for custom objects 2021-03-06 13:17:43 +00:00			`env = gym.make("Pendulum-v0")`
Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00
			`# The noise objects for DDPG`
			`n_actions = env.action_space.shape[-1]`
			`action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))`

Add custom objects support + bug fix (#336) * Add support for custom objects * Add python 3.8 to the CI * Bump version * PyType fixes * [ci skip] Fix typo * Add note about slow-down + fix typos * Minor edits to the doc * Bug fix for DQN * Update test * Add test for custom objects 2021-03-06 13:17:43 +00:00			`model = DDPG("MlpPolicy", env, action_noise=action_noise, verbose=1)`
Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00			`model.learn(total_timesteps=10000, log_interval=10)`
			`model.save("ddpg_pendulum")`
			`env = model.get_env()`

			`del model # remove to demonstrate saving and loading`

			`model = DDPG.load("ddpg_pendulum")`

			`obs = env.reset()`
			`while True:`
			`action, _states = model.predict(obs)`
			`obs, rewards, dones, info = env.step(action)`
			`env.render()`

Update documentation (#199) * Update doc and add new example * Add save/load replay buffer example * Add save format + export doc * Add example for get/set parameters * Typos and minor edits * Add results sections * Add note about performance * Add DDPG results * Address comments * Fix grammar/wording Co-authored-by: Anssi "Miffyli" Kanervisto <kaneran21@hotmail.com> 2020-10-28 08:55:16 +00:00			`Results`
			`-------`

			`PyBullet Environments`
			`^^^^^^^^^^^^^^^^^^^^^`

			`Results on the PyBullet benchmark (1M steps) using 6 seeds.`
			The complete learning curves are available in the `associated issue #48 <https://github.com/DLR-RM/stable-baselines3/issues/48>`_.


			`.. note::`

			Hyperparameters of :ref:`TD3 <td3>` from the `gSDE paper <https://arxiv.org/abs/2005.05719>`_ were used for ``DDPG``.


			`Gaussian means that the unstructured Gaussian noise is used for exploration,`
			`gSDE (generalized State-Dependent Exploration) is used otherwise.`

			`+--------------+--------------+--------------+--------------+`
			`\| Environments \| DDPG \| TD3 \| SAC \|`
			`+==============+==============+==============+==============+`
			`\| \| Gaussian \| Gaussian \| gSDE \|`
			`+--------------+--------------+--------------+--------------+`
			`\| HalfCheetah \| 2272 +/- 69 \| 2774 +/- 35 \| 2984 +/- 202 \|`
			`+--------------+--------------+--------------+--------------+`
			`\| Ant \| 1651 +/- 407 \| 3305 +/- 43 \| 3102 +/- 37 \|`
			`+--------------+--------------+--------------+--------------+`
			`\| Hopper \| 1201 +/- 211 \| 2429 +/- 126 \| 2262 +/- 1 \|`
			`+--------------+--------------+--------------+--------------+`
			`\| Walker2D \| 882 +/- 186 \| 2063 +/- 185 \| 2136 +/- 67 \|`
			`+--------------+--------------+--------------+--------------+`



			`How to replicate the results?`
			`^^^^^^^^^^^^^^^^^^^^^^^^^^^^^`

			Clone the `rl-zoo repo <https://github.com/DLR-RM/rl-baselines3-zoo>`_:

			`.. code-block:: bash`

			`git clone https://github.com/DLR-RM/rl-baselines3-zoo`
			`cd rl-baselines3-zoo/`


			Run the benchmark (replace ``$ENV_ID`` by the envs mentioned above):

			`.. code-block:: bash`

			`python train.py --algo ddpg --env $ENV_ID --eval-episodes 10 --eval-freq 10000`


			`Plot the results:`

			`.. code-block:: bash`

			`python scripts/all_plots.py -a ddpg -e HalfCheetah Ant Hopper Walker2D -f logs/ -o logs/ddpg_results`
			`python scripts/plot_from_file.py -i logs/ddpg_results.pkl -latex -l DDPG`


Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00
			`Parameters`
			`----------`

			`.. autoclass:: DDPG`
			`:members:`
			`:inherited-members:`

			`.. _ddpg_policies:`

			`DDPG Policies`
			`-------------`

			`.. autoclass:: MlpPolicy`
			`:members:`
			`:inherited-members:`

Implement HER (#120) * Added working her version, Online sampling is missing. * Updated test_her. * Added first version of online her sampling. Still problems with tensor dimensions. * Reformat * Fixed tests * Added some comments. * Updated changelog. * Add missing init file * Fixed some small bugs. * Reduced arguments for HER, small changes. * Added getattr. Fixed bug for online sampling. * Updated save/load funtions. Small changes. * Added her to init. * Updated save method. * Updated her ratio. * Move obs_wrapper * Added DQN test. * Fix potential bug * Offline and online her share same sample_goal function. * Changed lists into arrays. * Updated her test. * Fix online sampling * Fixed action bug. Updated time limit for episodes. * Updated convert_dict method to take keys as arguments. * Renamed obs dict wrapper. * Seed bit flipping env * Remove get_episode_dict * Add fast online sampling version * Added documentation. * Vectorized reward computation * Vectorized goal sampling * Update time limit for episodes in online her sampling. * Fix max episode length inference * Bug fix for Fetch envs * Fix for HER + gSDE * Reformat (new black version) * Added info dict to compute new reward. Check her_replay_buffer again. * Fix info buffer * Updated done flag. * Fixes for gSDE * Offline her version uses now HerReplayBuffer as episode storage. * Fix num_timesteps computation * Fix get torch params * Vectorized version for offline sampling. * Modified offline her sampling to use sample method of her_replay_buffer * Updated HER tests. * Updated documentation * Cleanup docstrings * Updated to review comments * Fix pytype * Update according to review comments. * Removed random goal strategy. Updated sample transitions. * Updated migration. Removed time signal removal. * Update doc * Fix potential load issue * Add VecNormalize support for dict obs * Updated saving/loading replay buffer for HER. * Fix test memory usage * Fixed save/load replay buffer. * Fixed save/load replay buffer * Fixed transition index after loading replay buffer in online sampling * Better error handling * Add tests for get_time_limit * More tests for VecNormalize with dict obs * Update doc * Improve HER description * Add test for sde support * Add comments * Add comments * Remove check that was always valid * Fix for terminal observation * Updated buffer size in offline version and reset of HER buffer * Reformat * Update doc * Remove np.empty + add doc * Fix loading * Updated loading replay buffer * Separate online and offline sampling + bug fixes * Update tensorboard log name * Version bump * Bug fix for special case Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de> Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-10-22 09:56:43 +00:00			`.. autoclass:: stable_baselines3.td3.policies.TD3Policy`
			`:members:`
			`:noindex:`
Implement DDPG (#92) * Add DDPG + TD3 with any number of critics * Allow any number of critics for SAC * Update doc * [ci skip] Update DDPG example * Remove unused parameter * Add DDPG to identity test * Fix computation with n_critics=1,3 * Update doc * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docstrings for off-policy algos * Add check for sde Co-authored-by: Adam Gleave <adam@gleave.me> 2020-07-16 12:14:22 +00:00
Implement HER (#120) * Added working her version, Online sampling is missing. * Updated test_her. * Added first version of online her sampling. Still problems with tensor dimensions. * Reformat * Fixed tests * Added some comments. * Updated changelog. * Add missing init file * Fixed some small bugs. * Reduced arguments for HER, small changes. * Added getattr. Fixed bug for online sampling. * Updated save/load funtions. Small changes. * Added her to init. * Updated save method. * Updated her ratio. * Move obs_wrapper * Added DQN test. * Fix potential bug * Offline and online her share same sample_goal function. * Changed lists into arrays. * Updated her test. * Fix online sampling * Fixed action bug. Updated time limit for episodes. * Updated convert_dict method to take keys as arguments. * Renamed obs dict wrapper. * Seed bit flipping env * Remove get_episode_dict * Add fast online sampling version * Added documentation. * Vectorized reward computation * Vectorized goal sampling * Update time limit for episodes in online her sampling. * Fix max episode length inference * Bug fix for Fetch envs * Fix for HER + gSDE * Reformat (new black version) * Added info dict to compute new reward. Check her_replay_buffer again. * Fix info buffer * Updated done flag. * Fixes for gSDE * Offline her version uses now HerReplayBuffer as episode storage. * Fix num_timesteps computation * Fix get torch params * Vectorized version for offline sampling. * Modified offline her sampling to use sample method of her_replay_buffer * Updated HER tests. * Updated documentation * Cleanup docstrings * Updated to review comments * Fix pytype * Update according to review comments. * Removed random goal strategy. Updated sample transitions. * Updated migration. Removed time signal removal. * Update doc * Fix potential load issue * Add VecNormalize support for dict obs * Updated saving/loading replay buffer for HER. * Fix test memory usage * Fixed save/load replay buffer. * Fixed save/load replay buffer * Fixed transition index after loading replay buffer in online sampling * Better error handling * Add tests for get_time_limit * More tests for VecNormalize with dict obs * Update doc * Improve HER description * Add test for sde support * Add comments * Add comments * Remove check that was always valid * Fix for terminal observation * Updated buffer size in offline version and reset of HER buffer * Reformat * Update doc * Remove np.empty + add doc * Fix loading * Updated loading replay buffer * Separate online and offline sampling + bug fixes * Update tensorboard log name * Version bump * Bug fix for special case Co-authored-by: Antonin Raffin <antonin.raffin@dlr.de> Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-10-22 09:56:43 +00:00			`.. autoclass:: CnnPolicy`
			`:members:`
Update PPO KL Divergence Estimator (#419) * remove unused all_kl_divs memory * new kl approximate equation * move kl check before update step * update changelog * add continue_training flag update to kl check * add verbose check * update changelog * lint with black * r -> log_ratio * Add link to PR * invert ratio * Fix for Sphinx v4.0 Co-authored-by: Anssi <kaneran21@hotmail.com> Co-authored-by: Antonin Raffin <antonin.raffin@ensta.org> 2021-05-10 10:21:00 +00:00			`:noindex:`