stable-baselines3/docs/modules/sac.rst
Antonin Raffin b4dc9d4e4d Add doc
2019-09-26 11:46:40 +02:00

110 lines
2.7 KiB
ReStructuredText

.. _sac:
.. automodule:: torchy_baselines.sac
SAC
===
`Soft Actor Critic (SAC) <https://spinningup.openai.com/en/latest/algorithms/sac.html>`_ Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.
SAC is the successor of `Soft Q-Learning SQL <https://arxiv.org/abs/1702.08165>`_ and incorporates the double Q-learning trick from TD3.
A key feature of SAC, and a major difference with common RL algorithms, is that it is trained to maximize a trade-off between expected return and entropy, a measure of randomness in the policy.
.. warning::
The SAC model does not support ``torchy_baselines.common.policies`` because it uses double q-values
and value estimation, as a result it must use its own policy models (see :ref:`sac_policies`).
.. rubric:: Available Policies
.. autosummary::
:nosignatures:
MlpPolicy
Notes
-----
- Original paper: https://arxiv.org/abs/1801.01290
- OpenAI Spinning Guide for SAC: https://spinningup.openai.com/en/latest/algorithms/sac.html
- Original Implementation: https://github.com/haarnoja/sac
- Blog post on using SAC with real robots: https://bair.berkeley.edu/blog/2018/12/14/sac/
.. note::
In our implementation, we use an entropy coefficient (as in OpenAI Spinning or Facebook Horizon),
which is the equivalent to the inverse of reward scale in the original SAC paper.
The main reason is that it avoids having too high errors when updating the Q functions.
.. note::
The default policies for SAC differ a bit from others MlpPolicy: it uses ReLU instead of tanh activation,
to match the original paper
Can I use?
----------
- Recurrent policies: ❌
- Multi processing: ❌
- Gym spaces:
============= ====== ===========
Space Action Observation
============= ====== ===========
Discrete ❌ ❌
Box ✔️ ✔️
MultiDiscrete ❌ ❌
MultiBinary ❌ ❌
============= ====== ===========
Example
-------
.. code-block:: python
import gym
import numpy as np
from torchy_baselines.sac.policies import MlpPolicy
from torchy_baselines.common.vec_env import DummyVecEnv
from torchy_baselines import SAC
env = gym.make('Pendulum-v0')
env = DummyVecEnv([lambda: env])
model = SAC(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=50000, log_interval=10)
model.save("sac_pendulum")
del model # remove to demonstrate saving and loading
model = SAC.load("sac_pendulum")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
Parameters
----------
.. autoclass:: SAC
:members:
:inherited-members:
.. _sac_policies:
SAC Policies
-------------
.. autoclass:: MlpPolicy
:members:
:inherited-members: