How does the Q-Learning update the qTable by using the reinforcement learning toolbox?

Tracy Shang
Tracy Shang 2021 年 5 月 1 日
編集済み: Tracy Shang 2021 年 5 月 4 日
The 'MaxEpisodes' and "maxStepPerEpisode' are set to 1.
I ran the following code. After the first episode, the Q(4,1) is set to -1.
However, I ran the “train section" and the both Q(4,1) and Q(4,2) are updated, as shown in the following figure.
In the second episode, the action 2 is executed in state 4. Therefore, In my opion, only Q(4,2) should be updated as -1.
Why is Q(4,2) set to 0.7441?
Why is Q(4,1) is updated too and set to -1.67?
GW = createGridWorld(4,4);
GW.CurrentState = '[2,1]';
GW.TerminalStates = '[4,4]';
nS = numel(GW.States);
nA = numel(GW.Actions);
GW.R = -1*ones(nS,nS,nA);
GW.R(:,state2idx(GW,GW.TerminalStates),:) = 10;
env = rlMDPEnv(GW);
qTable = rlTable(getObservationInfo(env),getActionInfo(env));
critic = rlQValueRepresentation(qTable,getObservationInfo(env),getActionInfo(env));
critic.Options.LearnRate =1;
agentOpt = rlQAgentOptions;
agentOpt.EpsilonGreedyExploration.Epsilon = 0.05;
agentOpt.DiscountFactor = 1;
agent = rlQAgent(critic, agentOpt);
env.Model.Viewer.ShowTrace = true;
%% train section
opt = rlTrainingOptions(...
'Plots', "none",...
trainStats = train(agent,env,opt);
aa = getLearnableParameters(getCritic(agent));

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2021 年 5 月 3 日
Can you try
This parameter is nonzero by default and likely the reason for the discrepancy you are observing
Tracy Shang
Tracy Shang 2021 年 5 月 4 日
Thanks for your answer!
I tried the code you suggested. The resut showed no difference.
But you inspired me!
I tried another parameter just like as follows. The qTable was updated as shown in the following figure.
critic.Options.OptimizerParameters.GradientDecayFactor =0;
I tried both parameters by add the following codes and the qTable was updated as shown in the following figure. At least, the question about Q(4,1) is solved.
According the parameters I set, the equtation of calculating Qvalue is simplified as follows.
That is, .
Why is Q(4,2) set to -1.4139?
critic.Options.OptimizerParameters.GradientDecayFactor =0;
Looking forward to your further answer. Thank you very much!


