Understanding Entropy Loss for PPO Agents Exploration

Question

Mike Jadwin 2023 年 10 月 10 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2031574-understanding-entropy-loss-for-ppo-agents-exploration

コメント済み: Mohammed Mohiuddin 2024 年 4 月 15 日

Hello,

I have been experimenting with a PPO agent training on a continous action space. I am a little confused with how the exploration works when using entopy loss. I have mostly used epsilon greedy exploration in the past which seems easier to understand in terms of how the agent explores (taking random actions with probability epsilon, and epsilon decay is easy to calculate knowing the decay rate). This means I know exactly the number of training iterations where the agent should start relying on the trained policy instead of exploring. Im not able to understand how the entropy term controls exploration in the same sense.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Emmanouil Tzorakoleftherakis 2023 年 10 月 11 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2031574-understanding-entropy-loss-for-ppo-agents-exploration#answer_1331054

Hi,

In PPO, the goal of training is to strike a balance between the entropy term and fine tuning the probabilities for all available action. This happens throughout training, as, unlike epsilon greedy approach, exploration in PPO does not diminish over time. This page and references therein should be helpful.

Also, don't forget that PPO is stochastic so there is always some exploration happening when sampling the action distribution. If after training you want to just use the action mean (i.e. not sample to get the policy output), you can set this option to 0.

Hope this helps

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Mike Jadwin 2024 年 2 月 29 日

yeah I dont think what I tried was very ideal but heres what I did: Set a specific number of trianing epochs you want to complete for each learning rate. For example you can start with high entropy for maybe 1000 epochs, then you take that trained agent and initialize a new agent training paramters with that agent as the initialization and a lower entropy term. Its not ideal because every time you kick off a new training session it will open a new training history window, so depending on how many times you do this it can get pretty cluttered. Especially if you want to do a linear decay where the entropy is changing frequently, I had to just turn off the plotter so it doesnt refresh everytime. It might be worth finidng another agent since there is not a built in way to aneal the exploration in the default PPO agent. Seems like its designed to explore throughout the entire training time, which to me seemed to result in unstable or suboptimal results.

Mohammed Mohiuddin 2024 年 4 月 15 日

Thank you for your suggestion. I tried this approach and it seemed to work but like you said it is not a very efficient approach.

サインインしてコメントする。

Understanding Entropy Loss for PPO Agents Exploration

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Understanding Entropy Loss for PPO Agents Exploration

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示