From c5f0aa5de0a1a8a8b226665cfe45ccb09df353bc Mon Sep 17 00:00:00 2001 From: Antonin RAFFIN Date: Sun, 1 May 2022 16:26:34 +0200 Subject: [PATCH] Update doc: PPO blog post and remark on timeouts (#896) --- docs/guide/rl_tips.rst | 19 ++++++++++++++++--- docs/misc/changelog.rst | 2 ++ docs/modules/ppo.rst | 1 + 3 files changed, 19 insertions(+), 3 deletions(-) diff --git a/docs/guide/rl_tips.rst b/docs/guide/rl_tips.rst index 031f947..2f093f6 100644 --- a/docs/guide/rl_tips.rst +++ b/docs/guide/rl_tips.rst @@ -183,6 +183,16 @@ Some basic advice: - start with shaped reward (i.e. informative reward) and simplified version of your problem - debug with random actions to check that your environment works and follows the gym interface: +Two important things to keep in mind when creating a custom environment is to avoid breaking Markov assumption +and properly handle termination due to a timeout (maximum number of steps in an episode). +For instance, if there is some time delay between action and observation (e.g. due to wifi communication), you should give an history of observations +as input. + +Termination due to timeout (max number of steps per episode) needs to be handled separately. You should fill the key in the info dict: ``info["TimeLimit.truncated"] = True``. +If you are using the gym ``TimeLimit`` wrapper, this will be done automatically. +You can read `Time Limit in RL `_ or take a look at the `RL Tips and Tricks video `_ +for more details. + We provide a helper to check that your environment runs without error: @@ -241,12 +251,15 @@ We *recommend following those steps to have a working RL algorithm*: 1. Read the original paper several times 2. Read existing implementations (if available) 3. Try to have some "sign of life" on toy problems -4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo) - You usually need to run hyperparameter optimization for that step. +4. Validate the implementation by making it run on harder and harder envs (you can compare results against the RL zoo). + You usually need to run hyperparameter optimization for that step. -You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf `issue #75 `_) +You need to be particularly careful on the shape of the different objects you are manipulating (a broadcast mistake will fail silently cf. `issue #75 `_) and when to stop the gradient propagation. +Don't forget to handle termination due to timeout separately (see remark in the custom environment section above), +you can also take a look at `Issue #284 `_ and `Issue #633 `_. + A personal pick (by @araffin) for environments with gradual difficulty in RL with continuous actions: 1. Pendulum (easy to solve) diff --git a/docs/misc/changelog.rst b/docs/misc/changelog.rst index 409b672..42a1d5a 100644 --- a/docs/misc/changelog.rst +++ b/docs/misc/changelog.rst @@ -37,6 +37,8 @@ Documentation: ^^^^^^^^^^^^^^ - Added link to gym doc and gym env checker - Fix typo in PPO doc (@bcollazo) +- Added link to PPO ICLR blog post +- Added remark about breaking Markov assumption and timeout handling Release 1.5.0 (2022-03-25) diff --git a/docs/modules/ppo.rst b/docs/modules/ppo.rst index 3aab653..d32986c 100644 --- a/docs/modules/ppo.rst +++ b/docs/modules/ppo.rst @@ -25,6 +25,7 @@ Notes - Clear explanation of PPO on Arxiv Insights channel: https://www.youtube.com/watch?v=5P7I-xPq8u8 - OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/ - Spinning Up guide: https://spinningup.openai.com/en/latest/algorithms/ppo.html +- 37 implementation details blog: https://ppo-details.cleanrl.dev//2021/11/05/ppo-implementation-details/ Can I use?