- PPO is now closer to the original implementation (no clipping of the value function by default), cf PPO section below
- Orthogonal initialization is only used by A2C/PPO
- The features extractor (CNN extractor) is shared between policy and q-networks for DDPG/SAC/TD3 and only the policy loss used to update it (much faster)
- Tensorboard legacy logging was dropped in favor of having one logger for the terminal and Tensorboard (cf :ref:`Tensorboard integration <tensorboard>`)
- We dropped ACKTR/ACER support because of their complexity compared to simpler alternatives (PPO, SAC, TD3) performing as good.
- We dropped GAIL support as we are focusing on model-free RL only, you can however take a look at the :ref:`imitation project <imitation>` which implements
GAIL and other imitation learning algorithms on top of SB3.
``nminibatches`` gave different batch size depending on the number of environments: ``batch_size = (n_steps * n_envs) // nminibatches``
-``clip_range_vf`` behavior for PPO is slightly different: Set it to ``None`` (default) to deactivate clipping (in SB2, you had to pass ``-1``, ``None`` meant to use ``clip_range`` for the clipping)
-``lam`` -> ``gae_lambda``
-``noptepochs`` -> ``n_epochs``
PPO default hyperparameters are the one tuned for continuous control environment.
We recommend taking a look at the :ref:`RL Zoo <rl_zoo>` for hyperparameters tuned for Atari games.
DQN
^^^
Only the vanilla DQN is implemented right now but extensions will follow.
Default hyperparameters are taken from the nature paper, except for the optimizer and learning rate that were taken from Stable Baselines defaults.
DDPG
^^^^
DDPG now follows the same interface as SAC/TD3.
For state/reward normalization, you should use ``VecNormalize`` as for all other algorithms.
SAC/TD3
^^^^^^^
SAC/TD3 now accept any number of critics, e.g. ``policy_kwargs=dict(n_critics=3)``, instead of only two before.
..note::
SAC/TD3 default hyperparameters (including network architecture) now match the ones from the original papers.
DDPG is using TD3 defaults.
SAC
^^^
SAC implementation matches the latest version of the original implementation: it uses two Q function networks and two target Q function networks
instead of two Q function networks and one Value function network (SB2 implementation, first version of the original implementation).
Despite this change, no change in performance should be expected.
..note::
SAC ``predict()`` method has now ``deterministic=False`` by default for consistency.
To match SB2 behavior, you need to explicitly pass ``deterministic=True``
Please read the :ref:`Developer Guide <developer>` section.
New Features (SB3 vs SB2)
=========================
- Much cleaner and consistent base code (and no more warnings =D!) and static type checks
- Independent saving/loading/predict for policies
- A2C now supports Generalized Advantage Estimation (GAE) and advantage normalization (both are deactivated by default)
- Generalized State-Dependent Exploration (gSDE) exploration is available for A2C/PPO/SAC. It allows to use RL directly on real robots (cf https://arxiv.org/abs/2005.05719)
- Proper evaluation (using separate env) is included in the base class (using ``EvalCallback``),
if you pass the environment as a string, you can pass ``create_eval_env=True`` to the algorithm constructor.
- Better saving/loading: optimizers are now included in the saved parameters and there is two new methods ``save_replay_buffer`` and ``load_replay_buffer`` for the replay buffer when using off-policy algorithms (DQN/DDPG/SAC/TD3)
- You can pass ``optimizer_class`` and ``optimizer_kwargs`` to ``policy_kwargs`` in order to easily
customize optimizers
- Seeding now works properly to have deterministic results
- Replay buffer does not grow, allocate everything at build time (faster)
- We added a memory efficient replay buffer variant (pass ``optimize_memory_usage=True`` to the constructor), it reduces drastically the memory used especially when using images
- You can specify an arbitrary number of critics for SAC/TD3 (e.g. ``policy_kwargs=dict(n_critics=3)``)