Update doc: PPO blog post and remark on timeouts (#896)

2026-07-09 17:29:20 +00:00 · 2022-05-01 16:26:34 +02:00 · 2022-05-01 16:26:34 +02:00 · c5f0aa5de0
commit c5f0aa5de0
parent a6f5049a99
3 changed files with 19 additions and 3 deletions
--- a/docs/guide/rl_tips.rst
+++ b/docs/guide/rl_tips.rst
@ -183,6 +183,16 @@ Some basic advice:
 - start with shaped reward (i.e. informative reward) and simplified version of your problem
 - debug with random actions to check that your environment works and follows the gym interface:

+Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption
+and properly handle termination due to a timeout (maximum number of steps in an episode).
+For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give an history of observations
+as input.
+
+Termination due to timeout (max number of steps per episode) needs to be handled separately. You should fill the key in the info dict: ``info["TimeLimit.truncated"] = True``.
+If you are using the gym ``TimeLimit`` wrapper, this will be done automatically.
+You can read `Time Limit in RL <https://arxiv.org/abs/1712.00378>`_ or take a look at the `RL Tips and Tricks video <https://www.youtube.com/watch?v=Ikngt0_DXJg>`_
+for more details.
+

 We provide a helper to check that your environment runs without error:

@ -241,12 +251,15 @@ We *recommend following those steps to have a working RL algorithm*:
 1. Read the original paper several times
 2. Read existing implementations (if available)
 3. Try to have some "sign of life" on toy problems
-4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo)
-	You usually need to run hyperparameter optimization for that step.
+4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo).
+   You usually need to run hyperparameter optimization for that step.

-You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf `issue #75 <https://github.com/hill-a/stable-baselines/pull/76>`_)
+You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf. `issue #75 <https://github.com/hill-a/stable-baselines/pull/76>`_)
 and when to stop the gradient propagation.

+Don't forget to handle termination due to timeout separately (see remark in the custom environment section above),
+you can also take a look at `Issue #284 <https://github.com/DLR-RM/stable-baselines3/issues/284>`_ and `Issue #633 <https://github.com/DLR-RM/stable-baselines3/issues/633>`_.
+
 A personal pick (by @araffin) for environments with gradual difficulty in RL with continuous actions:

 1. Pendulum (easy to solve)
--- a/docs/misc/changelog.rst
+++ b/docs/misc/changelog.rst
@ -37,6 +37,8 @@ Documentation:
 ^^^^^^^^^^^^^^
 - Added link to gym doc and gym env checker
 - Fix typo in PPO doc (@bcollazo)
+- Added link to PPO ICLR blog post
+- Added remark about breaking Markov assumption and timeout handling


 Release 1.5.0 (2022-03-25)
--- a/docs/modules/ppo.rst
+++ b/docs/modules/ppo.rst
@ -25,6 +25,7 @@ Notes
 - Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
 - OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
 - Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html
+- 37 implementation details blog: https://ppo-details.cleanrl.dev//2021/11/05/ppo-implementation-details/


 Can I use?