Update doc: PPO blog post and remark on timeouts (#896)

This commit is contained in:
Antonin RAFFIN 2022-05-01 16:26:34 +02:00 committed by GitHub
parent a6f5049a99
commit c5f0aa5de0
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 19 additions and 3 deletions

View file

@ -183,6 +183,16 @@ Some basic advice:
- start with shaped reward (i.e. informative reward) and simplified version of your problem
- debug with random actions to check that your environment works and follows the gym interface:
Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption
and properly handle termination due to a timeout (maximum number of steps in an episode).
For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give an history of observations
as input.
Termination due to timeout (max number of steps per episode) needs to be handled separately. You should fill the key in the info dict: ``info["TimeLimit.truncated"] = True``.
If you are using the gym ``TimeLimit`` wrapper, this will be done automatically.
You can read `Time Limit in RL <https://arxiv.org/abs/1712.00378>`_ or take a look at the `RL Tips and Tricks video <https://www.youtube.com/watch?v=Ikngt0_DXJg>`_
for more details.
We provide a helper to check that your environment runs without error:
@ -241,12 +251,15 @@ We *recommend following those steps to have a working RL algorithm*:
1. Read the original paper several times
2. Read existing implementations (if available)
3. Try to have some "sign of life" on toy problems
4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo)
You usually need to run hyperparameter optimization for that step.
4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo).
You usually need to run hyperparameter optimization for that step.
You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf `issue #75 <https://github.com/hill-a/stable-baselines/pull/76>`_)
You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf. `issue #75 <https://github.com/hill-a/stable-baselines/pull/76>`_)
and when to stop the gradient propagation.
Don't forget to handle termination due to timeout separately (see remark in the custom environment section above),
you can also take a look at `Issue #284 <https://github.com/DLR-RM/stable-baselines3/issues/284>`_ and `Issue #633 <https://github.com/DLR-RM/stable-baselines3/issues/633>`_.
A personal pick (by @araffin) for environments with gradual difficulty in RL with continuous actions:
1. Pendulum (easy to solve)

View file

@ -37,6 +37,8 @@ Documentation:
^^^^^^^^^^^^^^
- Added link to gym doc and gym env checker
- Fix typo in PPO doc (@bcollazo)
- Added link to PPO ICLR blog post
- Added remark about breaking Markov assumption and timeout handling
Release 1.5.0 (2022-03-25)

View file

@ -25,6 +25,7 @@ Notes
- Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
- Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html
- 37 implementation details blog: https://ppo-details.cleanrl.dev//2021/11/05/ppo-implementation-details/
Can I use?