some of the saved agents in DQN reinforcement learning algorithm do not reproduce the training rewards

Question

masoud k 2020 年 9 月 21 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/597502-some-of-the-saved-agents-in-dqn-reinforcement-learning-algorithm-do-not-reproduce-the-training-rewar

コメント済み: Abdul Basith Ashraf 2021 年 4 月 3 日

Hello everyone,

I am training a DQN agent in my reinforcement learning problem in Simulink Matlab 2019a. I have set a save criteria of agents with episode rewards equal to 6. My problem is that after a number of episodes, some agents will save based on my save criteria, but when I use them, some of the agents do not reproduce rewards equal to 6 or more but the amount of rewards is less than 6. If anyone could help me I will appriciate.

2 件のコメント
なしを表示なしを非表示

Madhav Thakker 2020 年 9 月 24 日

編集済み: Madhav Thakker 2020 年 9 月 24 日

Hi masoud,

Do you save the agent as soon as you get an episode with reward 6? Are you using an epsilon-greedy policy to train the agents? How less are the rewards when you use the saved agent?

masoud k 2020 年 9 月 24 日

Hi Madhav,

The criteria for saving the candidate agents is getting reward of 6 or more of an episode.

I use epsilon-greedy policy to train agents with epsilon equal to 0.6 and the decay rate of 0.001.

Some of saved agents come up with an episode reward around 1 and some other with 6 or more.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Emmanouil Tzorakoleftherakis 2020 年 9 月 24 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/597502-some-of-the-saved-agents-in-dqn-reinforcement-learning-algorithm-do-not-reproduce-the-training-rewar#answer_499744

This is a common misconception. When you train an agent and you get a certain reward out of a specific episode, if you were to stop training and run the same episode again with the same agent you would most likely get a different response from the agent and thus a different episode reward.

There are a few reasons for that:

1) During training, the agent explores different options based on some probabilistic exploration strategy. For DQN this strategy is related to the epsilon parameter in DQN options. After training, the agent does not explore anymore and only relies on the underlying neural network for inference, so it makes sense that results are different.

2) It is best practice to randomize various elements of the environment during training to get a more robust policy. In that case there will also be differences in how the environment behaves during and after training, so agent will respond differently as well.

3) Some agents are stochastic. This by itself implies different decisions/behavior even if everything else in the environment remains deterministic.

Hope that helps

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Emmanouil Tzorakoleftherakis 2020 年 9 月 24 日

Typically, exploration slows down over time and the agent starts exploiting more. For example, you can see how epsilon decays for DQN here. If you start getting consistently good rewards at the later stage of training where exploration is low, then that's a good sign that this behavior will carry over after training. You may still not get exactly the same rewards (especially if the minimum epsilon value is nonzero) but you should overall get desirable performance.

Abdul Basith Ashraf 2021 年 4 月 3 日

Are you saying that we should use an agent from a higher episode number?

サインインしてコメントする。

some of the saved agents in DQN reinforcement learning algorithm do not reproduce the training rewards

2 件のコメント
なしを表示なしを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

some of the saved agents in DQN reinforcement learning algorithm do not reproduce the training rewards

2 件のコメント なしを表示なしを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

2 件のコメント
なしを表示なしを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示