stable-baselines3/docs/modules/ppo.rst

.. _ppo2:

.. automodule:: stable_baselines3.ppo

PPO
===

The `Proximal Policy Optimization <https://arxiv.org/abs/1707.06347>`_ algorithm combines ideas from A2C (having multiple workers)
and TRPO (it uses a trust region to improve the actor).

The main idea is that after an update, the new policy should be not too far form the old policy.
For that, ppo uses clipping to avoid too large update.


.. note::

  PPO contains several modifications from the original algorithm not documented
  by OpenAI: advantages are normalized and value function can be also clipped .


Notes
-----

- Original paper: https://arxiv.org/abs/1707.06347
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
- Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html


Can I use?
----------

-  Recurrent policies: ❌
-  Multi processing: ✔️
-  Gym spaces:


============= ====== ===========
Space         Action Observation
============= ====== ===========
Discrete      ❌      ❌
Box           ✔️      ✔️
MultiDiscrete ❌     ❌
MultiBinary   ❌      ❌
============= ====== ===========

Example
-------

Train a PPO agent on ``Pendulum-v0`` using 4 environments.

.. code-block:: python

  import gym

  from stable_baselines3 import A2C
  from stable_baselines3.ppo import MlpPolicy
  from stable_baselines3.common.cmd_utils import make_vec_env

  # Parallel environments
  env = make_vec_env('CartPole-v1', n_envs=4)

  model = PPO(MlpPolicy, env, verbose=1)
  model.learn(total_timesteps=25000)
  model.save("ppo_cartpole")

  del model # remove to demonstrate saving and loading

  model = PPO.load("ppo_cartpole")

  obs = env.reset()
  while True:
      action, _states = model.predict(obs)
      obs, rewards, dones, info = env.step(action)
      env.render()

Parameters
----------

.. autoclass:: PPO
  :members:
  :inherited-members:
Add doc 2019-09-26 09:46:40 +00:00			`.. _ppo2:`

Rename to stable-baselines3 2020-05-05 13:02:35 +00:00			`.. automodule:: stable_baselines3.ppo`
Add doc 2019-09-26 09:46:40 +00:00
			`PPO`
			`===`

			The `Proximal Policy Optimization <https://arxiv.org/abs/1707.06347>`_ algorithm combines ideas from A2C (having multiple workers)
			`and TRPO (it uses a trust region to improve the actor).`

			`The main idea is that after an update, the new policy should be not too far form the old policy.`
			`For that, ppo uses clipping to avoid too large update.`


			`.. note::`

			`PPO contains several modifications from the original algorithm not documented`
			`by OpenAI: advantages are normalized and value function can be also clipped .`


			`Notes`
			`-----`

			`- Original paper: https://arxiv.org/abs/1707.06347`
			`- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8`
			`- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/`
Fix typing errors and typos 2020-01-22 16:17:12 +00:00			`- Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html`
Add doc 2019-09-26 09:46:40 +00:00

			`Can I use?`
			`----------`

			`- Recurrent policies: ❌`
			`- Multi processing: ✔️`
			`- Gym spaces:`


			`============= ====== ===========`
			`Space Action Observation`
			`============= ====== ===========`
			`Discrete ❌ ❌`
			`Box ✔️ ✔️`
			`MultiDiscrete ❌ ❌`
			`MultiBinary ❌ ❌`
			`============= ====== ===========`

			`Example`
			`-------`

Add base doc 2020-05-07 08:10:51 +00:00			Train a PPO agent on ``Pendulum-v0`` using 4 environments.
Add doc 2019-09-26 09:46:40 +00:00
			`.. code-block:: python`

Add base doc 2020-05-07 08:10:51 +00:00			`import gym`
Add doc 2019-09-26 09:46:40 +00:00
Add base doc 2020-05-07 08:10:51 +00:00			`from stable_baselines3 import A2C`
			`from stable_baselines3.ppo import MlpPolicy`
			`from stable_baselines3.common.cmd_utils import make_vec_env`
Add doc 2019-09-26 09:46:40 +00:00
Add base doc 2020-05-07 08:10:51 +00:00			`# Parallel environments`
			`env = make_vec_env('CartPole-v1', n_envs=4)`
Add doc 2019-09-26 09:46:40 +00:00
Add base doc 2020-05-07 08:10:51 +00:00			`model = PPO(MlpPolicy, env, verbose=1)`
			`model.learn(total_timesteps=25000)`
			`model.save("ppo_cartpole")`
Add doc 2019-09-26 09:46:40 +00:00
Add base doc 2020-05-07 08:10:51 +00:00			`del model # remove to demonstrate saving and loading`
Add doc 2019-09-26 09:46:40 +00:00
Add base doc 2020-05-07 08:10:51 +00:00			`model = PPO.load("ppo_cartpole")`
Add doc 2019-09-26 09:46:40 +00:00
Add base doc 2020-05-07 08:10:51 +00:00			`obs = env.reset()`
			`while True:`
			`action, _states = model.predict(obs)`
			`obs, rewards, dones, info = env.step(action)`
			`env.render()`
Add doc 2019-09-26 09:46:40 +00:00
			`Parameters`
			`----------`

			`.. autoclass:: PPO`
			`:members:`
			`:inherited-members:`