stable-baselines3/docs/modules/ppo.rst

83 lines
1.9 KiB
ReStructuredText
Raw Normal View History

2019-09-26 09:46:40 +00:00
.. _ppo2:
2020-05-05 13:02:35 +00:00
.. automodule:: stable_baselines3.ppo
2019-09-26 09:46:40 +00:00
PPO
===
The `Proximal Policy Optimization <https://arxiv.org/abs/1707.06347>`_ algorithm combines ideas from A2C (having multiple workers)
and TRPO (it uses a trust region to improve the actor).
The main idea is that after an update, the new policy should be not too far form the old policy.
For that, ppo uses clipping to avoid too large update.
.. note::
PPO contains several modifications from the original algorithm not documented
by OpenAI: advantages are normalized and value function can be also clipped .
Notes
-----
- Original paper: https://arxiv.org/abs/1707.06347
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
2020-01-22 16:17:12 +00:00
- Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html
2019-09-26 09:46:40 +00:00
Can I use?
----------
- Recurrent policies: ❌
- Multi processing: ✔️
- Gym spaces:
============= ====== ===========
Space Action Observation
============= ====== ===========
Discrete ❌ ❌
Box ✔️ ✔️
MultiDiscrete ❌ ❌
MultiBinary ❌ ❌
============= ====== ===========
Example
-------
2020-05-07 08:10:51 +00:00
Train a PPO agent on ``Pendulum-v0`` using 4 environments.
2019-09-26 09:46:40 +00:00
.. code-block:: python
2020-05-07 08:10:51 +00:00
import gym
2019-09-26 09:46:40 +00:00
2020-05-07 08:10:51 +00:00
from stable_baselines3 import A2C
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.cmd_utils import make_vec_env
2019-09-26 09:46:40 +00:00
2020-05-07 08:10:51 +00:00
# Parallel environments
env = make_vec_env('CartPole-v1', n_envs=4)
2019-09-26 09:46:40 +00:00
2020-05-07 08:10:51 +00:00
model = PPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("ppo_cartpole")
2019-09-26 09:46:40 +00:00
2020-05-07 08:10:51 +00:00
del model # remove to demonstrate saving and loading
2019-09-26 09:46:40 +00:00
2020-05-07 08:10:51 +00:00
model = PPO.load("ppo_cartpole")
2019-09-26 09:46:40 +00:00
2020-05-07 08:10:51 +00:00
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
2019-09-26 09:46:40 +00:00
Parameters
----------
.. autoclass:: PPO
:members:
:inherited-members: