stable-baselines3/docs/guide/developer.rst

.. _developer:

================
Developer Guide
================

This guide is meant for those who want to understand the internals and the design choices of Stable-Baselines3.


At first, you should read the two issues where the design choices were discussed:

- https://github.com/hill-a/stable-baselines/issues/576
- https://github.com/hill-a/stable-baselines/issues/733


The library is not meant to be modular, although inheritance is used to reduce code duplication.


Algorithms Structure
====================


Each algorithm (on-policy and off-policy ones) follows a common structure.
Policy contains code for acting in the environment, and algorithm updates this policy.
There is one folder per algorithm, and in that folder there is the algorithm and the policy definition (``policies.py``).

Each algorithm has two main methods:

- ``.collect_rollouts()`` which defines how new samples are collected, usually inherited from the base class. Those samples are then stored in a ``RolloutBuffer`` (discarded after the gradient update) or ``ReplayBuffer``

- ``.train()`` which updates the parameters using samples from the buffer


.. image:: ../_static/img/sb3_loop.png


Where to start?
===============

The first thing you need to read and understand are the base classes in the ``common/`` folder:

- ``BaseAlgorithm`` in ``base_class.py`` which defines how an RL class should look like.
  It contains also all the "glue code" for saving/loading and the common operations (wrapping environments)

- ``BasePolicy`` in ``policies.py`` which defines how a policy class should look like.
  It contains also all the magic for the ``.predict()`` method, to handle as many spaces/cases as possible

- ``OffPolicyAlgorithm`` in ``off_policy_algorithm.py`` that contains the implementation of ``collect_rollouts()`` for the off-policy algorithms,
  and similarly ``OnPolicyAlgorithm`` in ``on_policy_algorithm.py``.


All the environments handled internally are assumed to be ``VecEnv`` (``gym.Env`` are automatically wrapped).


Pre-Processing
==============

To handle different observation spaces, some pre-processing needs to be done (e.g. one-hot encoding for discrete observation).
Most of the code for pre-processing is in ``common/preprocessing.py`` and ``common/policies.py``.

For images, environment is automatically wrapped with ``VecTransposeImage`` if observations are detected to be images with
channel-last convention to transform it to PyTorch's channel-first convention.


Policy Structure
================

When we refer to "policy" in Stable-Baselines3, this is usually an abuse of language compared to RL terminology.
In SB3, "policy" refers to the class that handles all the networks useful for training,
so not only the network used to predict actions (the "learned controller").
For instance, the ``TD3`` policy contains the actor, the critic and the target networks.

To avoid the hassle of importing specific policy classes for specific algorithm (e.g. both A2C and PPO use ``ActorCriticPolicy``),
SB3 uses names like "MlpPolicy" and "CnnPolicy" to refer policies using small feed-forward networks or convolutional networks,
respectively. Importing ``[algorithm]/policies.py`` registers an appropriate policy for that algorithm under those names.

Probability distributions
=========================

When needed, the policies handle the different probability distributions.
All distributions are located in ``common/distributions.py`` and follow the same interface.
Each distribution corresponds to a type of action space (e.g. ``Categorical`` is the one used for discrete actions.
For continuous actions, we can use multiple distributions ("DiagGaussian", "SquashedGaussian" or "StateDependentDistribution")

State-Dependent Exploration
===========================

State-Dependent Exploration (SDE) is a type of exploration that allows to use RL directly on real robots,
that was the starting point for the Stable-Baselines3 library.
I (@araffin) published a paper about a generalized version of SDE (the one implemented in SB3): https://arxiv.org/abs/2005.05719

Misc
====

The rest of the ``common/`` is composed of helpers (e.g. evaluation helpers) or basic components (like the callbacks).
The ``type_aliases.py`` file contains common type hint aliases like ``GymStepReturn``.

Et voilà?

After reading this guide and the mentioned files, you should be now able to understand the design logic behind the library ;)
Add developer guide 2020-05-08 14:20:21 +00:00			`.. _developer:`

			`================`
			`Developer Guide`
			`================`

Doc update (#15) 2020-05-11 10:28:43 +00:00			`This guide is meant for those who want to understand the internals and the design choices of Stable-Baselines3.`
Add developer guide 2020-05-08 14:20:21 +00:00

			`At first, you should read the two issues where the design choices were discussed:`

			`- https://github.com/hill-a/stable-baselines/issues/576`
			`- https://github.com/hill-a/stable-baselines/issues/733`


			`The library is not meant to be modular, although inheritance is used to reduce code duplication.`


			`Algorithms Structure`
			`====================`

Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00
Doc update (#15) 2020-05-11 10:28:43 +00:00			`Each algorithm (on-policy and off-policy ones) follows a common structure.`
Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00			`Policy contains code for acting in the environment, and algorithm updates this policy.`
Add developer guide 2020-05-08 14:20:21 +00:00			There is one folder per algorithm, and in that folder there is the algorithm and the policy definition (``policies.py``).

Doc update (#15) 2020-05-11 10:28:43 +00:00			`Each algorithm has two main methods:`
Add developer guide 2020-05-08 14:20:21 +00:00
			- ``.collect_rollouts()`` which defines how new samples are collected, usually inherited from the base class. Those samples are then stored in a ``RolloutBuffer`` (discarded after the gradient update) or ``ReplayBuffer``

Doc update (#15) 2020-05-11 10:28:43 +00:00			- ``.train()`` which updates the parameters using samples from the buffer
Add developer guide 2020-05-08 14:20:21 +00:00

Stable-Baselines3 v1.0 (#354) * Bump version and update doc * Fix name * Apply suggestions from code review Co-authored-by: Adam Gleave <adam@gleave.me> * Update docs/index.rst Co-authored-by: Adam Gleave <adam@gleave.me> * Update wording for RL zoo Co-authored-by: Adam Gleave <adam@gleave.me> 2021-03-17 13:20:31 +00:00			`.. image:: ../_static/img/sb3_loop.png`


Add developer guide 2020-05-08 14:20:21 +00:00			`Where to start?`
			`===============`

			The first thing you need to read and understand are the base classes in the ``common/`` folder:

Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00			- ``BaseAlgorithm`` in ``base_class.py`` which defines how an RL class should look like.
Doc update (#15) 2020-05-11 10:28:43 +00:00			`It contains also all the "glue code" for saving/loading and the common operations (wrapping environments)`
Add developer guide 2020-05-08 14:20:21 +00:00
Doc update (#15) 2020-05-11 10:28:43 +00:00			- ``BasePolicy`` in ``policies.py`` which defines how a policy class should look like.
Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00			It contains also all the magic for the ``.predict()`` method, to handle as many spaces/cases as possible
Add developer guide 2020-05-08 14:20:21 +00:00
Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00			- ``OffPolicyAlgorithm`` in ``off_policy_algorithm.py`` that contains the implementation of ``collect_rollouts()`` for the off-policy algorithms,
			and similarly ``OnPolicyAlgorithm`` in ``on_policy_algorithm.py``.
Add developer guide 2020-05-08 14:20:21 +00:00

Doc update (#15) 2020-05-11 10:28:43 +00:00			All the environments handled internally are assumed to be ``VecEnv`` (``gym.Env`` are automatically wrapped).
Add developer guide 2020-05-08 14:20:21 +00:00

			`Pre-Processing`
			`==============`

			`To handle different observation spaces, some pre-processing needs to be done (e.g. one-hot encoding for discrete observation).`
Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00			Most of the code for pre-processing is in ``common/preprocessing.py`` and ``common/policies.py``.
Add developer guide 2020-05-08 14:20:21 +00:00
Avoid transposing channel-first envs (#213) * Add test for channel-first environments * Add support for channel-first envs, including more tests * Update changelog * Run black * Run black, again * Improve NatureCNN error message * Update image checks and FrameStack wrapper * Update tests * Update docs * Run isort * Reformat * Fixes: avoid breaking changes for non-image env * Add additional checks * Update docstring Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-11-03 11:34:09 +00:00			For images, environment is automatically wrapped with ``VecTransposeImage`` if observations are detected to be images with
			`channel-last convention to transform it to PyTorch's channel-first convention.`
Add developer guide 2020-05-08 14:20:21 +00:00

			`Policy Structure`
			`================`

			`When we refer to "policy" in Stable-Baselines3, this is usually an abuse of language compared to RL terminology.`
Doc update (#15) 2020-05-11 10:28:43 +00:00			`In SB3, "policy" refers to the class that handles all the networks useful for training,`
Add developer guide 2020-05-08 14:20:21 +00:00			`so not only the network used to predict actions (the "learned controller").`
			For instance, the ``TD3`` policy contains the actor, the critic and the target networks.

Review of code (A2C, PPO and refactoring) (#35) * Split torch module code into torch_layers file * Updated reference to CNN * Change 'CxWxH' to 'CxHxW', as per common notion * Fix missing import in policies.py * Move PPOPolicy to OnlineActorCriticPolicy * Create OnPolicyRLModel from PPO, and make A2C and PPO inherit * Update A2C optimizer comment * Clean weight init scales for clarity * Fix A2C log_interval default parameter * Rename 'progress' to 'progress_remaining * Rename 'Models' to 'Algorithms' * Rename 'OnlineActorCriticPolicy' to 'ActorCriticPolicy' * Move static functions out from BaseAlgorithm * Move on/off_policy base algorithms to their own files * Add files for A2C/PPO * Fix docs * Fix pytype * Update documentation on OnPolicyAlgorithm * Add proper doctstring for on_policy rollout gathering * Add bit clarification on the mlppolicy/cnnpolicy naming * Move static function is_vectorized_policies to utils.py * Checking docstrings, pep8 fixes * Update changelog * Clean changelog * Remove policy warnings for sac/td3 * Add monitor_wrapper for OnPolicyAlgorithm. Clean tb logging variables. Add parameter keywords to OffPolicyAlgorithm super init Co-authored-by: Antonin RAFFIN <antonin.raffin@ensta.org> 2020-06-09 11:54:18 +00:00			To avoid the hassle of importing specific policy classes for specific algorithm (e.g. both A2C and PPO use ``ActorCriticPolicy``),
			`SB3 uses names like "MlpPolicy" and "CnnPolicy" to refer policies using small feed-forward networks or convolutional networks,`
			respectively. Importing ``[algorithm]/policies.py`` registers an appropriate policy for that algorithm under those names.

Add developer guide 2020-05-08 14:20:21 +00:00			`Probability distributions`
			`=========================`

			`When needed, the policies handle the different probability distributions.`
			All distributions are located in ``common/distributions.py`` and follow the same interface.
Doc update (#15) 2020-05-11 10:28:43 +00:00			Each distribution corresponds to a type of action space (e.g. ``Categorical`` is the one used for discrete actions.
Add developer guide 2020-05-08 14:20:21 +00:00			`For continuous actions, we can use multiple distributions ("DiagGaussian", "SquashedGaussian" or "StateDependentDistribution")`

			`State-Dependent Exploration`
			`===========================`

			`State-Dependent Exploration (SDE) is a type of exploration that allows to use RL directly on real robots,`
			`that was the starting point for the Stable-Baselines3 library.`
Documentation update and style fixes (#21) * Update doc: add gSDE * Fix codestyle * Remove travis script * Add lint check to gitlab 2020-05-15 11:54:06 +00:00			`I (@araffin) published a paper about a generalized version of SDE (the one implemented in SB3): https://arxiv.org/abs/2005.05719`
Add developer guide 2020-05-08 14:20:21 +00:00
			`Misc`
			`====`

			The rest of the ``common/`` is composed of helpers (e.g. evaluation helpers) or basic components (like the callbacks).
			The ``type_aliases.py`` file contains common type hint aliases like ``GymStepReturn``.

			`Et voilà?`

			`After reading this guide and the mentioned files, you should be now able to understand the design logic behind the library ;)`