Reinforcement Leaning DQN Training Convergence Problem

Question

Gülin Sayal 2021 年 6 月 6 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/849045-reinforcement-leaning-dqn-training-convergence-problem

回答済み: Darshak 2025 年 4 月 29 日

training.PNG

Hi everyone,

I am designing an energy management system for a vehicle, and using DQN for optimizing fuel consumption. Here are some related lines from my code.

env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
nI = obsInfo.Dimension(1);  
nL = 24;   
nO = numel(actInfo.Elements);
dnn = [
    featureInputLayer(nI,'Name','state','Normalization','none')
    fullyConnectedLayer(nL,'Name','fc1')
    reluLayer('Name','relu1')
    fullyConnectedLayer(nL,'Name','fc2')
    reluLayer('Name','relu2')
    fullyConnectedLayer(nO,'Name','output')];
criticOpts = rlRepresentationOptions('LearnRate',0.00025,'GradientThreshold',1);
critic = rlQValueRepresentation(dnn,obsInfo,actInfo,'Observation',{'state'},criticOpts);
agentOpts = rlDQNAgentOptions(...
    'UseDoubleDQN',false, ...    
    'TargetUpdateMethod',"periodic", ...
    'TargetUpdateFrequency',4, ...   
    'ExperienceBufferLength',1000, ...
    'DiscountFactor',0.99, ...
    'MiniBatchSize',32);
agentOptions.EpsilonGreedyExploration.Epsilon=1;
agentOptions.EpsilonGreedyExploration.EpsilonMin=0.2;
agentOptions.EpsilonGreedyExploration.EpsilonDecay=0.0050;
agentObj = rlDQNAgent(critic,agentOpts)
maxepisodes = 10000;
maxsteps = ceil(T/Ts);
trainingOpts = rlTrainingOptions('MaxEpisodes',10000,...
    'MaxStepsPerEpisode',maxsteps,...
    'Verbose',false,...
    'Plots','training-progress',...
    'StopTrainingCriteria','EpisodeReward',...
    'StopTrainingValue', 0);
 trainingStats = train(agentObj,env,trainingOpts)

The problem is that after training, rewards do not converge. Moreover, long-term estimated cumulative reward Q0 diverges. I already read some posts regarding the topic here, then I normalized my action and observation space which did not help. In addition to that, I also tried adding scaling layer right before the last fullyConnectedLayer which also did not help. You can find my training progress curves in attachment.

So, what can I try further so that Q0 does not diverge and episode rewards converge.

Also, I would really like to know how the Q0 is calculated. It is not possible for my model to have such big long-term estimated rewards.

Best Regards,

Gülin

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Darshak 2025 年 4 月 29 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/849045-reinforcement-leaning-dqn-training-convergence-problem#answer_1564408

MATLAB Online で開く

Hello Gülin Sayal,

I understand that the model does not converge, which might be due to “Q₀”.

“Q₀” is the initial state-action value estimate computed from the target critic network at the beginning of each episode (typically at time step t=0).

Mathematically, for a state “s₀”, the agent computes:

Q₀ = max_a Q_target(s₀, a)

Where:

Q_target is the target critic network, updated periodically.
max_a implies it selects the maximum Q-value over all possible actions at that state.

If Q₀ diverges, it indicates instability in the value estimates that might happen due to:

High learning rate
Poor network structure
Unnormalized input/output
Unstable reward scale
Improper target network update frequency

To resolve diverging Q₀ values and non-converging rewards in DQN training, you may refer to the steps mentioned below:

1. Scale rewards in the environment (e.g., divide by a constant) to keep them within a range like [-1, 1].

reward = rawReward / 100;

2. Use Double DQN to reduce Q-value overestimation.

agentOpts = rlDQNAgentOptions('UseDoubleDQN', true, ...);

You can refer to the following documentation to gain further understanding “rlDQNAgentOptions” function: https://www.mathworks.com/help/releases/R2021a/reinforcement-learning/ref/rldqnagentoptions.html

3. Add a tanhLayer after the output to bound Q-values between -1 and 1.

dnn = [ 
featureInputLayer(nI,'Name','state','Normalization','none') 
fullyConnectedLayer(64,'Name','fc1') 
reluLayer 
fullyConnectedLayer(64,'Name','fc2') 
reluLayer 
fullyConnectedLayer(nO,'Name','output') 
tanhLayer('Name','tanhOut')];

You can refer to the following documentation for more information on “tanhLayer” function: https://www.mathworks.com/help/releases/R2021a/deeplearning/ref/nnet.cnn.layer.tanhlayer.html

4. Reduce the critic's learning rate for stable updates.

criticOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',0.5);

5. Make target updates less frequent to reduce instability.

agentOpts.TargetUpdateFrequency = 20;

6. Use deeper networks to improve approximation capability.

% Replace 'nL = 24' with: 
fullyConnectedLayer(64) 
reluLayer 
fullyConnectedLayer(64) 

7. Ensure exploration reduces over time:

agentOpts.EpsilonGreedyExploration.Epsilon = 1; 
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1; 
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01; 

I hope this resolves the issue.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Reinforcement Leaning DQN Training Convergence Problem

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Reinforcement Leaning DQN Training Convergence Problem

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示