High fluctuation in Q0 value for TD3 agent while training.

Question

James Sorokhaibam 2024 年 5 月 12 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2117901-high-fluctuation-in-q0-value-for-td3-agent-while-training

回答済み: Ronit 2024 年 5 月 23 日

I am training a TD3 RL agent for pick and place robot. The reward function is, reward = exp(-E/d) where E is the total energy consumed where the trajectory is complete and d is the distance of the object from the end-effector. The training went smoothly while using DQN agent but it fails when DDPG, TD3 are used. What could be the reasion for this? I used the following code for agent creation.

obsInfo = rlNumericSpec([34 1]);

actInfo = rlNumericSpec([14 1], ...

LowerLimit=-1, ...

UpperLimit= 1);

env = rlFunctionEnv(obsInfo,actInfo,"KondoStepFunction","KondoResetFunction");

agent = rlTD3Agent(obsInfo,actInfo);

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Ronit 2024 年 5 月 23 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2117901-high-fluctuation-in-q0-value-for-td3-agent-while-training#answer_1462046

Hello James,

To understand why there are high fluctuations while using different RL agents, firstly we need to understand how these agents work.

The primary difference between DQN and agents like DDPG and TD3 is that DQN is just a value-based learning method, whereas DDPG and TD3 use the actor-critic method.
The DQN network tries to predict the Q values for each state-action pair, so it is just a single model. On the other hand, DDPG has a critic model that determines the Q value but uses the actor model to determine the action to take. Hence, we can say DDPG tries to directly learn the policy whereas DQN learns the Q values which are used to define the policy, generally an epsilon-greedy policy.
So, training an agent with DDPG or TD3 must be done more carefully. Not only because its learning is sometimes unstable, but because the number of hyperparameters to fine-tune in it is pretty much double that of DQN.

Here are a few suggestions which can help in getting good results using TD3 or DDPG agents:

Tune Hyperparameters: Adjust learning rates, replay buffer size, and exploration noise.
Normalize Rewards: Consider scaling your reward to reduce variability and improve learning stability.
Monitor Training: Use diagnostics to understand action, reward, and learning dynamics better.

Adjusting these aspects can help mitigate the high fluctuation and improve your TD3 agent's training performance.

Hope this helps!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

High fluctuation in Q0 value for TD3 agent while training.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

High fluctuation in Q0 value for TD3 agent while training.

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示