2020-05-08 14:20:21 +00:00
.. _developer:
================
Developer Guide
================
2020-05-11 10:28:43 +00:00
This guide is meant for those who want to understand the internals and the design choices of Stable-Baselines3.
2020-05-08 14:20:21 +00:00
At first, you should read the two issues where the design choices were discussed:
- https://github.com/hill-a/stable-baselines/issues/576
- https://github.com/hill-a/stable-baselines/issues/733
The library is not meant to be modular, although inheritance is used to reduce code duplication.
Algorithms Structure
====================
2020-06-09 11:54:18 +00:00
2020-05-11 10:28:43 +00:00
Each algorithm (on-policy and off-policy ones) follows a common structure.
2020-06-09 11:54:18 +00:00
Policy contains code for acting in the environment, and algorithm updates this policy.
2020-05-08 14:20:21 +00:00
There is one folder per algorithm, and in that folder there is the algorithm and the policy definition (`` policies.py `` ).
2020-05-11 10:28:43 +00:00
Each algorithm has two main methods:
2020-05-08 14:20:21 +00:00
- `` .collect_rollouts() `` which defines how new samples are collected, usually inherited from the base class. Those samples are then stored in a `` RolloutBuffer `` (discarded after the gradient update) or `` ReplayBuffer ``
2020-05-11 10:28:43 +00:00
- `` .train() `` which updates the parameters using samples from the buffer
2020-05-08 14:20:21 +00:00
2021-03-17 13:20:31 +00:00
.. image :: ../_static/img/sb3_loop.png
2020-05-08 14:20:21 +00:00
Where to start?
===============
The first thing you need to read and understand are the base classes in the `` common/ `` folder:
2020-06-09 11:54:18 +00:00
- `` BaseAlgorithm `` in `` base_class.py `` which defines how an RL class should look like.
2020-05-11 10:28:43 +00:00
It contains also all the "glue code" for saving/loading and the common operations (wrapping environments)
2020-05-08 14:20:21 +00:00
2020-05-11 10:28:43 +00:00
- `` BasePolicy `` in `` policies.py `` which defines how a policy class should look like.
2020-06-09 11:54:18 +00:00
It contains also all the magic for the `` .predict() `` method, to handle as many spaces/cases as possible
2020-05-08 14:20:21 +00:00
2020-06-09 11:54:18 +00:00
- `` OffPolicyAlgorithm `` in `` off_policy_algorithm.py `` that contains the implementation of `` collect_rollouts() `` for the off-policy algorithms,
and similarly `` OnPolicyAlgorithm `` in `` on_policy_algorithm.py `` .
2020-05-08 14:20:21 +00:00
2020-05-11 10:28:43 +00:00
All the environments handled internally are assumed to be `` VecEnv `` (`` gym.Env `` are automatically wrapped).
2020-05-08 14:20:21 +00:00
Pre-Processing
==============
To handle different observation spaces, some pre-processing needs to be done (e.g. one-hot encoding for discrete observation).
2020-06-09 11:54:18 +00:00
Most of the code for pre-processing is in `` common/preprocessing.py `` and `` common/policies.py `` .
2020-05-08 14:20:21 +00:00
2020-11-03 11:34:09 +00:00
For images, environment is automatically wrapped with `` VecTransposeImage `` if observations are detected to be images with
channel-last convention to transform it to PyTorch's channel-first convention.
2020-05-08 14:20:21 +00:00
Policy Structure
================
When we refer to "policy" in Stable-Baselines3, this is usually an abuse of language compared to RL terminology.
2020-05-11 10:28:43 +00:00
In SB3, "policy" refers to the class that handles all the networks useful for training,
2020-05-08 14:20:21 +00:00
so not only the network used to predict actions (the "learned controller").
For instance, the `` TD3 `` policy contains the actor, the critic and the target networks.
2020-06-09 11:54:18 +00:00
To avoid the hassle of importing specific policy classes for specific algorithm (e.g. both A2C and PPO use `` ActorCriticPolicy `` ),
SB3 uses names like "MlpPolicy" and "CnnPolicy" to refer policies using small feed-forward networks or convolutional networks,
respectively. Importing `` [algorithm]/policies.py `` registers an appropriate policy for that algorithm under those names.
2020-05-08 14:20:21 +00:00
Probability distributions
=========================
When needed, the policies handle the different probability distributions.
All distributions are located in `` common/distributions.py `` and follow the same interface.
2020-05-11 10:28:43 +00:00
Each distribution corresponds to a type of action space (e.g. `` Categorical `` is the one used for discrete actions.
2020-05-08 14:20:21 +00:00
For continuous actions, we can use multiple distributions ("DiagGaussian", "SquashedGaussian" or "StateDependentDistribution")
State-Dependent Exploration
===========================
State-Dependent Exploration (SDE) is a type of exploration that allows to use RL directly on real robots,
that was the starting point for the Stable-Baselines3 library.
2020-05-15 11:54:06 +00:00
I (@araffin) published a paper about a generalized version of SDE (the one implemented in SB3): https://arxiv.org/abs/2005.05719
2020-05-08 14:20:21 +00:00
Misc
====
The rest of the `` common/ `` is composed of helpers (e.g. evaluation helpers) or basic components (like the callbacks).
The `` type_aliases.py `` file contains common type hint aliases like `` GymStepReturn `` .
Et voilà?
After reading this guide and the mentioned files, you should be now able to understand the design logic behind the library ;)