DDPG has two different policies
古いコメントを表示
Hello,
I'm training a DDPG agent for autonomous driving. The thing is when I save the agent, the policy is way different than expected. I know that when training, an exploration policy is used, and when the agent is saved, it is a greedy policy. However, these two policies are so different. The following graph shows what I mean. In red, the greedy policy, and in blue, the exploration policy. As I see, taking the documentation into consideration, the greedy policy should be the exploration policy without the added noise. Then, why is greedy policy above the exploration policy? In fact, when I run a simulation, If I move down the greedy policy a little (-150) I get better results. I would like to know why this is happening, how the greedy policy is made, why these two are so different and how to solve it.

Here it is my main code:
clear all;clc
rng(6);
epochs = 80; %30
mdl = 'MODELO';
stoptrainingcriteria = "AverageReward";
stoptrainingvalue = 2000000;
load_system(mdl);
numObs = 1;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';
ActionInfo = rlNumericSpec([1 1],...
LowerLimit=[1]',...
UpperLimit=[1000]');
ActionInfo.Name = 'alfa';
blk = [mdl,'/RL Agent'];
env = rlSimulinkEnv(mdl,blk,obsInfo,ActionInfo);
env.ResetFcn = @(in) resetfunction(in, mdl);
initOpts = rlAgentInitializationOptions('NumHiddenUnit',32); %32
agent = rlDDPGAgent(obsInfo, ActionInfo, initOpts);
agent.SampleTime = 1;% -1
agent.AgentOptions.NoiseOptions.MeanAttractionConstant = 1/30;% 1/30
agent.AgentOptions.NoiseOptions.StandardDeviation = 41; % 41
agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 0.00001;% 0
agent.AgentOptions.NumStepsToLookAhead = 32; % 32
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
opt = rlTrainingOptions(...
'MaxEpisodes', epochs,...
'MaxStepsPerEpisode', 1000,... % 1000
'StopTrainingCriteria', stoptrainingcriteria,...
'StopTrainingValue', stoptrainingvalue,...
'Verbose', true,...
'Plots', "training-progress");
trainResults = train(agent,env,opt);
generatePolicyFunction(agent);
Here it is the function I use to create the graph:
policy1 = getGreedyPolicy(agent);
policy2 = getExplorationPolicy(agent);
x_values = 0:0.1:120;
actions1 = zeros(length(x_values), 1);
actions2 = zeros(length(x_values), 1);
for i = 1:length(x_values)
actions1(i) = cell2mat(policy1.getAction(x_values(i)));
actions2(i) = cell2mat(policy2.getAction(x_values(i)));
end
hold on
plot(x_values, actions2);
plot(x_values, actions1, 'LineWidth', 2);
hold off
Thanks in advance!
採用された回答
その他の回答 (4 件)
Emmanouil Tzorakoleftherakis
2023 年 5 月 23 日
The comparison plot is not set up correctly. The noisy policy also has a noise state which needs to be propagated after each call. This explains while there is an offset between greedy and exploration policy.
The right way to get the actions and propagate the noise state would be
[action,policy] = getAction(policy,observation)
4 件のコメント
Jorge De la Rosa Padrón
2023 年 5 月 23 日
Emmanouil Tzorakoleftherakis
2023 年 5 月 23 日
Not exactly sure what you mean. Can you share the new plot?
Jorge De la Rosa Padrón
2023 年 5 月 24 日
編集済み: Jorge De la Rosa Padrón
2023 年 5 月 24 日
Emmanouil Tzorakoleftherakis
2023 年 5 月 24 日
I see. That's a separate question with no definitive answer. If you only have a 1-1 map, consider using an even smaller network than the 32 hidden nodes that you have. Maybe some of the bias you observed will be eliminated?
awcii
2023 年 7 月 20 日
0 投票
did you solve the problem ?
can you try it by changing the actir and critic learn rates.
Elena
2025 年 11 月 13 日
0 投票
clear all;clc
rng(6);
epochs = 80; %30
mdl = 'MODELO';
stoptrainingcriteria = "AverageReward";
stoptrainingvalue = 2000000;
load_system(mdl);
numObs = 1;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';
ActionInfo = rlNumericSpec([1 1],...
LowerLimit=[1]',...
UpperLimit=[1000]');
ActionInfo.Name = 'alfa';
blk = [mdl,'/RL Agent'];
env = rlSimulinkEnv(mdl,blk,obsInfo,ActionInfo);
env.ResetFcn = @(in) resetfunction(in, mdl);
initOpts = rlAgentInitializationOptions('NumHiddenUnit',32); %32
agent = rlDDPGAgent(obsInfo, ActionInfo, initOpts);
agent.SampleTime = 1;% -1
agent.AgentOptions.NoiseOptions.MeanAttractionConstant = 1/30;% 1/30
agent.AgentOptions.NoiseOptions.StandardDeviation = 41; % 41
agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 0.00001;% 0
agent.AgentOptions.NumStepsToLookAhead = 32; % 32
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
opt = rlTrainingOptions(...
'MaxEpisodes', epochs,...
'MaxStepsPerEpisode', 1000,... % 1000
'StopTrainingCriteria', stoptrainingcriteria,...
'StopTrainingValue', stoptrainingvalue,...
'Verbose', true,...
'Plots', "training-progress");
trainResults = train(agent,env,opt);
generatePolicyFunction(agent);
Here it is the function I use to create the graph:
policy1 = getGreedyPolicy(agent);
policy2 = getExplorationPolicy(agent);
x_values = 0:0.1:120;
actions1 = zeros(length(x_values), 1);
actions2 = zeros(length(x_values), 1);
for i = 1:length(x_values)
actions1(i) = cell2mat(policy1.getAction(x_values(i)));
actions2(i) = cell2mat(policy2.getAction(x_values(i)));
end
hold on
plot(x_values, actions2);
plot(x_values, actions1, 'LineWidth', 2);
hold off
Elena
2025 年 11 月 13 日
clear all;clc
rng(6);
epochs = 80; %30
mdl = 'MODELO';
stoptrainingcriteria = "AverageReward";
stoptrainingvalue = 2000000;
load_system(mdl);
numObs = 1;
obsInfo = rlNumericSpec([numObs 1]);
obsInfo.Name = 'observations';
ActionInfo = rlNumericSpec([1 1],...
LowerLimit=[1]',...
UpperLimit=[1000]');
ActionInfo.Name = 'alfa';
blk = [mdl,'/RL Agent'];
env = rlSimulinkEnv(mdl,blk,obsInfo,ActionInfo);
env.ResetFcn = @(in) resetfunction(in, mdl);
initOpts = rlAgentInitializationOptions('NumHiddenUnit',32); %32
agent = rlDDPGAgent(obsInfo, ActionInfo, initOpts);
agent.SampleTime = 1;% -1
agent.AgentOptions.NoiseOptions.MeanAttractionConstant = 1/30;% 1/30
agent.AgentOptions.NoiseOptions.StandardDeviation = 41; % 41
agent.AgentOptions.NoiseOptions.StandardDeviationDecayRate = 0.00001;% 0
agent.AgentOptions.NumStepsToLookAhead = 32; % 32
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
opt = rlTrainingOptions(...
'MaxEpisodes', epochs,...
'MaxStepsPerEpisode', 1000,... % 1000
'StopTrainingCriteria', stoptrainingcriteria,...
'StopTrainingValue', stoptrainingvalue,...
'Verbose', true,...
'Plots', "training-progress");
trainResults = train(agent,env,opt);
generatePolicyFunction(agent);
Here it is the function I use to create the graph:
policy1 = getGreedyPolicy(agent);
policy2 = getExplorationPolicy(agent);
x_values = 0:0.1:120;
actions1 = zeros(length(x_values), 1);
actions2 = zeros(length(x_values), 1);
for i = 1:length(x_values)
actions1(i) = cell2mat(policy1.getAction(x_values(i)));
actions2(i) = cell2mat(policy2.getAction(x_values(i)));
end
hold on
plot(x_values, actions2);
plot(x_values, actions1, 'LineWidth', 2);
hold off
カテゴリ
ヘルプ センター および File Exchange で Reinforcement Learning についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


