Reinforcement Leaning DQN Training Convergence Problem

5 ビュー (過去 30 日間)
Gülin Sayal
Gülin Sayal 2021 年 6 月 6 日
回答済み: Darshak 2025 年 4 月 29 日
Hi everyone,
I am designing an energy management system for a vehicle, and using DQN for optimizing fuel consumption. Here are some related lines from my code.
env = rlSimulinkEnv(mdl,agentblk,obsInfo,actInfo);
nI = obsInfo.Dimension(1);
nL = 24;
nO = numel(actInfo.Elements);
dnn = [
featureInputLayer(nI,'Name','state','Normalization','none')
fullyConnectedLayer(nL,'Name','fc1')
reluLayer('Name','relu1')
fullyConnectedLayer(nL,'Name','fc2')
reluLayer('Name','relu2')
fullyConnectedLayer(nO,'Name','output')];
criticOpts = rlRepresentationOptions('LearnRate',0.00025,'GradientThreshold',1);
critic = rlQValueRepresentation(dnn,obsInfo,actInfo,'Observation',{'state'},criticOpts);
agentOpts = rlDQNAgentOptions(...
'UseDoubleDQN',false, ...
'TargetUpdateMethod',"periodic", ...
'TargetUpdateFrequency',4, ...
'ExperienceBufferLength',1000, ...
'DiscountFactor',0.99, ...
'MiniBatchSize',32);
agentOptions.EpsilonGreedyExploration.Epsilon=1;
agentOptions.EpsilonGreedyExploration.EpsilonMin=0.2;
agentOptions.EpsilonGreedyExploration.EpsilonDecay=0.0050;
agentObj = rlDQNAgent(critic,agentOpts)
maxepisodes = 10000;
maxsteps = ceil(T/Ts);
trainingOpts = rlTrainingOptions('MaxEpisodes',10000,...
'MaxStepsPerEpisode',maxsteps,...
'Verbose',false,...
'Plots','training-progress',...
'StopTrainingCriteria','EpisodeReward',...
'StopTrainingValue', 0);
trainingStats = train(agentObj,env,trainingOpts)
The problem is that after training, rewards do not converge. Moreover, long-term estimated cumulative reward Q0 diverges. I already read some posts regarding the topic here, then I normalized my action and observation space which did not help. In addition to that, I also tried adding scaling layer right before the last fullyConnectedLayer which also did not help. You can find my training progress curves in attachment.
So, what can I try further so that Q0 does not diverge and episode rewards converge.
Also, I would really like to know how the Q0 is calculated. It is not possible for my model to have such big long-term estimated rewards.
Best Regards,
Gülin

回答 (1 件)

Darshak
Darshak 2025 年 4 月 29 日
Hello Gülin Sayal,
I understand that the model does not converge, which might be due to “Q₀”.
“Q₀” is the initial state-action value estimate computed from the target critic network at the beginning of each episode (typically at time step t=0).
Mathematically, for a state “s₀”, the agent computes:
Q= max_a Q_target(s, a)
Where:
  • Q_target is the target critic network, updated periodically.
  • max_a implies it selects the maximum Q-value over all possible actions at that state.
If Q₀ diverges, it indicates instability in the value estimates that might happen due to:
  • High learning rate
  • Poor network structure
  • Unnormalized input/output
  • Unstable reward scale
  • Improper target network update frequency
To resolve diverging Q₀ values and non-converging rewards in DQN training, you may refer to the steps mentioned below:
1. Scale rewards in the environment (e.g., divide by a constant) to keep them within a range like [-1, 1].
reward = rawReward / 100;
2. Use Double DQN to reduce Q-value overestimation.
agentOpts = rlDQNAgentOptions('UseDoubleDQN', true, ...);
You can refer to the following documentation to gain further understanding “rlDQNAgentOptions” function: https://www.mathworks.com/help/releases/R2021a/reinforcement-learning/ref/rldqnagentoptions.html
3. Add a tanhLayer after the output to bound Q-values between -1 and 1.
dnn = [
featureInputLayer(nI,'Name','state','Normalization','none')
fullyConnectedLayer(64,'Name','fc1')
reluLayer
fullyConnectedLayer(64,'Name','fc2')
reluLayer
fullyConnectedLayer(nO,'Name','output')
tanhLayer('Name','tanhOut')];
You can refer to the following documentation for more information on “tanhLayer” function: https://www.mathworks.com/help/releases/R2021a/deeplearning/ref/nnet.cnn.layer.tanhlayer.html
4. Reduce the critic's learning rate for stable updates.
criticOpts = rlRepresentationOptions('LearnRate',1e-4,'GradientThreshold',0.5);
5. Make target updates less frequent to reduce instability.
agentOpts.TargetUpdateFrequency = 20;
6. Use deeper networks to improve approximation capability.
% Replace 'nL = 24' with:
fullyConnectedLayer(64)
reluLayer
fullyConnectedLayer(64)
7. Ensure exploration reduces over time:
agentOpts.EpsilonGreedyExploration.Epsilon = 1;
agentOpts.EpsilonGreedyExploration.EpsilonMin = 0.1;
agentOpts.EpsilonGreedyExploration.EpsilonDecay = 0.01;
I hope this resolves the issue.

カテゴリ

Help Center および File ExchangeReinforcement Learning についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by