From c5f0aa5de0a1a8a8b226665cfe45ccb09df353bc Mon Sep 17 00:00:00 2001
From: Antonin RAFFIN <antonin.raffin@ensta.org>
Date: Sun, 1 May 2022 16:26:34 +0200
Subject: [PATCH] Update doc: PPO blog post and remark on timeouts (#896)

---
 docs/guide/rl_tips.rst  | 19 ++++++++++++++++---
 docs/misc/changelog.rst |  2 ++
 docs/modules/ppo.rst    |  1 +
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/docs/guide/rl_tips.rst b/docs/guide/rl_tips.rst
index 031f947..2f093f6 100644
--- a/docs/guide/rl_tips.rst
+++ b/docs/guide/rl_tips.rst
@@ -183,6 +183,16 @@ Some basic advice:
 - start with shaped reward (i.e. informative reward) and simplified version of your problem
 - debug with random actions to check that your environment works and follows the gym interface:
 
+Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption
+and properly handle termination due to a timeout (maximum number of steps in an episode).
+For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give an history of observations
+as input.
+
+Termination due to timeout (max number of steps per episode) needs to be handled separately. You should fill the key in the info dict: ``info["TimeLimit.truncated"] = True``.
+If you are using the gym ``TimeLimit`` wrapper, this will be done automatically.
+You can read `Time Limit in RL <https://arxiv.org/abs/1712.00378>`_ or take a look at the `RL Tips and Tricks video <https://www.youtube.com/watch?v=Ikngt0_DXJg>`_
+for more details.
+
 
 We provide a helper to check that your environment runs without error:
 
@@ -241,12 +251,15 @@ We *recommend following those steps to have a working RL algorithm*:
 1. Read the original paper several times
 2. Read existing implementations (if available)
 3. Try to have some "sign of life" on toy problems
-4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo)
-	You usually need to run hyperparameter optimization for that step.
+4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo).
+   You usually need to run hyperparameter optimization for that step.
 
-You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf `issue #75 <https://github.com/hill-a/stable-baselines/pull/76>`_)
+You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf. `issue #75 <https://github.com/hill-a/stable-baselines/pull/76>`_)
 and when to stop the gradient propagation.
 
+Don't forget to handle termination due to timeout separately (see remark in the custom environment section above),
+you can also take a look at `Issue #284 <https://github.com/DLR-RM/stable-baselines3/issues/284>`_ and `Issue #633 <https://github.com/DLR-RM/stable-baselines3/issues/633>`_.
+
 A personal pick (by @araffin) for environments with gradual difficulty in RL with continuous actions:
 
 1. Pendulum (easy to solve)
diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst
index 409b672..42a1d5a 100644
--- a/docs/misc/changelog.rst
+++ b/docs/misc/changelog.rst
@@ -37,6 +37,8 @@ Documentation:
 ^^^^^^^^^^^^^^
 - Added link to gym doc and gym env checker
 - Fix typo in PPO doc (@bcollazo)
+- Added link to PPO ICLR blog post
+- Added remark about breaking Markov assumption and timeout handling
 
 
 Release 1.5.0 (2022-03-25)
diff --git a/docs/modules/ppo.rst b/docs/modules/ppo.rst
index 3aab653..d32986c 100644
--- a/docs/modules/ppo.rst
+++ b/docs/modules/ppo.rst
@@ -25,6 +25,7 @@ Notes
 - Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8
 - OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
 - Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html
+- 37 implementation details blog: https://ppo-details.cleanrl.dev//2021/11/05/ppo-implementation-details/
 
 
 Can I use?