According to the Soft Actor Critic paper by Haarnoja et al. (2018) the TD learning, Policy update and the entropy coefficient or temperature update all have used log probability inside the Expectation symbol due to the soft state function and thus leading to Entropy indirectly. I want to know if it is a documentation error in MATLAB 2021a that entropy was used directly or is there a coding error in the backend. SInce i cant seem to find the function for the training loops for these functions i cannot verify for myself. I will put the formulas for comparison here as images as it might exceed the characters.
From the Spinning up in RL Documentation by Open AI parent company of the authors who tweaked the algo slightly to include only Q values, here only Log probability is used and then summed over.
In MATLAB's Documentation we have entropy before the summation
SAC is a very important off policy reinforcement learning algorithm for various research purposes which specialises in sample efficiency, if the mistakes in the documentation is reflected in the code then it will be a term higher in degree than entropy and thus non accurate results will occur and if it is a minor documentation error, nevertheless it needs to be fixed.