Add note about SAC ent coeff optimization (#2037)

* Allow new sphinx version * Add note about SAC ent coeff and add DQN tutorial link
2026-07-30 20:18:15 +00:00 · 2024-11-08 11:01:04 +01:00 · 2024-11-08 11:01:04 +01:00 · e4f4f123e3
commit e4f4f123e3
parent 8f0b488bc5
5 changed files with 8 additions and 2 deletions
--- a/docs/conda_env.yml
+++ b/docs/conda_env.yml
@ -14,6 +14,6 @@ dependencies:
    - pandas
    - numpy>=1.20,<2.0
    - matplotlib
-    - sphinx>=5,<8
+    - sphinx>=5,<9
    - sphinx_rtd_theme>=1.3.0
    - sphinx_copybutton
--- a/docs/misc/changelog.rst
+++ b/docs/misc/changelog.rst
@ -59,6 +59,7 @@ Bug Fixes:
 `SBX`_ (SB3 + Jax)
 ^^^^^^^^^^^^^^^^^^
 - Added CNN support for DQN
+- Bug fix for SAC and related algorithms, optimize log of ent coeff to be consistent with SB3

 Deprecations:
 ^^^^^^^^^^^^^
@ -80,6 +81,7 @@ Documentation:
 ^^^^^^^^^^^^^^
 - Updated PPO doc to recommend using CPU with ``MlpPolicy``
 - Clarified documentation about planned features and citing software
+- Added a note about the fact we are optimizing log of ent coeff for SAC

 Release 2.3.2 (2024-04-27)
 --------------------------
--- a/docs/modules/dqn.rst
+++ b/docs/modules/dqn.rst
@ -25,6 +25,7 @@ Notes

 - Original paper: https://arxiv.org/abs/1312.5602
 - Further reference: https://www.nature.com/articles/nature14236
+- Tutorial "From Tabular Q-Learning to DQN": https://github.com/araffin/rlss23-dqn-tutorial

 .. note::
    This implementation provides only vanilla Deep Q-Learning and has no extensions such as Double-DQN, Dueling-DQN and Prioritized Experience Replay.
--- a/docs/modules/sac.rst
+++ b/docs/modules/sac.rst
@ -35,6 +35,9 @@ Notes
    which is the equivalent to the inverse of reward scale in the original SAC paper.
    The main reason is that it avoids having too high errors when updating the Q functions.

+.. note::
+    When automatically adjusting the temperature (alpha/entropy coefficient), we optimize the logarithm of the entropy coefficient instead of the entropy coefficient itself. This is consistent with the original implementation and has proven to be more stable
+    (see issues `GH#36 <https://github.com/DLR-RM/stable-baselines3/issues/36>`_, `#55 <https://github.com/araffin/sbx/issues/55>`_ and others).

 .. note::

--- a/setup.py
+++ b/setup.py
@ -101,7 +101,7 @@ setup(
            "black>=24.2.0,<25",
        ],
        "docs": [
-            "sphinx>=5,<8",
+            "sphinx>=5,<9",
            "sphinx-autobuild",
            "sphinx-rtd-theme>=1.3.0",
            # For spelling