On updating the policy with sim functions and Custom Loop

2 ビュー (過去 30 日間)
shoki kobayashi
shoki kobayashi 2020 年 11 月 30 日
コメント済み: jiayi 2023 年 4 月 25 日
I'm currently trying to train the PPOAgent using the sim function and Custom Loop. However, when I use the sim function, the Actor and Critic networks don't update properly and keep repeating the same behavior. How can I get the network updates to work? Is it not a good idea to use the sim function instead of step in the first place... I think the sim function is the only way to do custom loops in the simlink environment, since step has the image of storing a history of actions, states and rewards in a buffer. We want to do it with a policy of not using functions.
% PPO without using the train function
clear all
rng(0)
%Construct Environment
mdl = 'cartpole';
open_system(mdl)
env = rlPredefinedEnv('CartPoleSimscapeModel-Continuous')
obsInfo = getObservationInfo(env);
numObs = obsInfo.Dimension(1);
actInfo = getActionInfo(env);
numAct = actInfo.Dimension(1);
Ts = 0.02;
Tf = 25;
PPOAgent
criticLayerSizes = [128 200];
actorLayerSizes = [128 200];
createNetworkWeights;
criticNetwork = [imageInputLayer([numObs 1 1],'Normalization','none','Name','observations')
fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1')
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2')
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(1,'Name','CriticOutput')
];
criticOpts = rlRepresentationOptions('LearnRate',1e-3);
critic = rlValueRepresentation(criticNetwork,env.getObservationInfo, ...
'Observation',{'observations'},criticOpts);
%ActorNetwork
inPath = [ imageInputLayer([numObs 1 1], 'Normalization','none','Name','observations')
fullyConnectedLayer(numAct,'Name','infc') ]; % 2 by 1 output
% path layers for mean value (2 by 1 input and 2 by 1 output)
% using scalingLayer to scale the range
meanPath = [ tanhLayer('Name','tanh'); % output range: (-1,1)
scalingLayer('Name','scale','Scale',actInfo.UpperLimit) ]; % output range: (-10,10)
% path layers for standard deviations (2 by 1 input and output)
% using softplus layer to make it non negative
sdevPath = softplusLayer('Name', 'splus');
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
% add layers to network object
net = layerGraph(inPath);
net = addLayers(net,meanPath);
net = addLayers(net,sdevPath);
net = addLayers(net,outLayer);
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenationLayer
net = connectLayers(net,'infc','tanh/in'); % connect output of inPath to meanPath input
net = connectLayers(net,'infc','splus/in'); % connect output of inPath to sdevPath input
net = connectLayers(net,'scale','mean&sdev/in1'); % connect output of meanPath to gaussPars input #1
net = connectLayers(net,'splus','mean&sdev/in2'); % connect output of sdevPath to gaussPars input #2
actorOptions = rlRepresentationOptions('LearnRate',1e-3);
Actor = rlStochasticActorRepresentation(net,obsInfo,actInfo,...
'Observation',{'observations'}, actorOptions);
opt = rlPPOAgentOptions('ExperienceHorizon',512,...
'ClipFactor',0.2,...
'EntropyLossWeight',0.02,...
'MiniBatchSize',64,...
'NumEpoch',5,...
'AdvantageEstimateMethod','gae',...
'GAEFactor',0.95,...
'SampleTime',Ts,...
'DiscountFactor',0.9995);
Actor = setLoss(Actor, @actorLossFunction);
agent = rlPPOAgent(Actor,critic,opt);
%prepare train
numEpisodes = 20000;
maxStepsPerEpisode = ceil(Tf/Ts);
discountFactor = 0.995;
aveWindowSize = 100;
trainingTerminationValue = 400;
episodeCumulativeRewardVector = [];
[trainingPlot,lineReward,lineAveReward] = hBuildFigure;
%Start learn
% Enable the training visualization plot.
set(trainingPlot,'Visible','on');
% Train the policy for the maximum number of episodes or until the average
% reward indicates that the policy is sufficiently trained.
for episodeCt = 1:numEpisodes
%sim
simout = sim(agent, env);
% 4. Create training data. Training is performed using batch data. The
% batch size equal to the length of the episode.
%batchSize = min(maxStepsPerEpisode,maxStepsPerEpisode);
batchsize = size(simout.Observation.observations.Data,3);
nextobservationBatch = simout.Observation.observations.Data(:,:,2:batchsize);
actionBatch = simout.Action.Action.Data;
rewardBatch = simout.Reward.Data';
isdonebatch = simout.IsDone.Data';
observationBatch = simout.Observation.observations.Data(:,:,1:batchsize-1);
episoderewardBatch = simout.Reward.Data;
% Compute the discounted future reward.
discountedReturn = zeros(1,batchsize-1);
for t = 1:batchsize-1
G = 0;
for k = t:batchsize-1
G = G + discountFactor ^ (k-t) * rewardBatch(k);
end
discountedReturn(t) = G;
end
%Gather the information needed to learn PPO
Observation{1} = observationBatch; %cellarray
nextobservation{1} = nextobservationBatch;%cellarray
[Advantages, CriticTargets] = computeGeneralizedAdvantage(critic, opt.DiscountFactor, opt.GAEFactor,Observation, nextobservation, rewardBatch,isdonebatch);
Action = actionBatch;
obsDimension{1} = obsInfo.Dimension;
ObsDimsToSlice = cellfun(@(x) numel(x) + 1, obsDimension','UniformOutput',false);
BufferLength = numel(CriticTargets);
%--------------------------------------------------------------------------------------------
OldActionProb = evaluate(Actor, Observation);
OldActionProb = OldActionProb{1};
OldActionProb = evaluate(Actor.SamplingStrategy, OldActionProb, Action);
LossVariable.ClipFactor = opt.ClipFactor;
LossVariable.EntropyLossWeight = opt.EntropyLossWeight;
LossVariable.SamplingStrategy = Actor.SamplingStrategy;
LossVariable.Action = Action;
LossVariable.OldPolicy = OldActionProb;
LossVariable.Advantage = Advantages;
MiniBatchIdx = rl.internal.dataTransformation.getMiniBatchIdx(BufferLength, opt.MiniBatchSize, 1);
for epoch = 1:opt.NumEpoch
for ct = 1:numel(MiniBatchIdx)
% Slice mini batch data
SingleBatchIdx = MiniBatchIdx{ct};
MiniBatchObs = rl.internal.dataTransformation.generalSubref(Observation, SingleBatchIdx, ObsDimsToSlice);
MiniBatchCriticTargets = rl.internal.dataTransformation.generalSubref(CriticTargets, SingleBatchIdx, ndims(CriticTargets));
% REVISIT: support single action channel
LossVariable.Action = rl.internal.dataTransformation.generalSubref(Action, SingleBatchIdx, ndims(Action));
LossVariable.OldPolicy = rl.internal.dataTransformation.generalSubref(OldActionProb, SingleBatchIdx, ndims(OldActionProb));
LossVariable.Advantage = rl.internal.dataTransformation.generalSubref(Advantages, SingleBatchIdx, ndims(Advantages));
% Scale the gradient based on ratio of current minibatch size over specified minibatch size
GradScale = single(numel(SingleBatchIdx)/opt.MiniBatchSize);
GradVal = gradient(critic, 'loss-parameters', MiniBatchObs, MiniBatchCriticTargets);
GradVal = rl.internal.dataTransformation.scaleLearnables(GradVal, GradScale);
critic = optimize(critic, GradVal);
% Update Actor
GradVal = gradient(Actor,'loss-parameters',MiniBatchObs,LossVariable);
GradVal = rl.internal.dataTransformation.scaleLearnables(GradVal, GradScale);
Actor = optimize(Actor, GradVal);
end
end
episodeCumulativeReward = sum(episoderewardBatch);
episodeCumulativeRewardVector = cat(2,...
episodeCumulativeRewardVector,episodeCumulativeReward);
movingAveReward = movmean(episodeCumulativeRewardVector,...
aveWindowSize,2);
addpoints(lineReward,episodeCt,episodeCumulativeReward);
addpoints(lineAveReward,episodeCt,movingAveReward(end));
drawnow;
if max(movingAveReward) > trainingTerminationValue
break
end
end
%plot env
obs = reset(env);
plot(env)
for maxStepsPerEpisode = 1:maxStepsPerEpisode
% Select action according to trained policy
action = getAction(Actor,{obs});
% Step the environment
[nextObs,reward,isdone] = step(env,action{1});
% Check for terminal condition
if isdone
break
end
obs = nextObs;
end
%create Function
function [Advantage, TDTarget] = computeGeneralizedAdvantage(StateValueEstimator, DiscountFactor, GAEFactor, Observation, nextObservation, rewardBatch,isdonebatch)
% Vectorized generalized advantage estimator (GAE)
% REVISIT: current implementation supports single episode
%BatchExperience = getBatchExperience(obj,hasState(StateValueEstimator));
% Unpack experience
% Observation = BatchExperience{1};
% Reward = BatchExperience{3};
% NextObservation = BatchExperience{4};
% IsDone = BatchExperience{5};
SequenceLength = numel(rewardBatch);
% Estimate current and next state values
CurrentStateValue = getValue(StateValueEstimator, Observation);
NextStateValue = getValue(StateValueEstimator, nextObservation);
NextStateValue(isdonebatch == 1) = 0; % early termination
% Vectorized GAE Advantages
% TDError = [TDError(1) TDError(2) ... TDError(4)]
TDError = rewardBatch + ...
reshape(DiscountFactor * NextStateValue - CurrentStateValue, size(rewardBatch));
if GAEFactor == 0
% If GAEFactor == 0, similar to 1 step look ahead (or TD0)
Advantage = TDError;
else
% Adv(1) = TDError(1) + A*TDError(2) + A^2*TDError(3) + A^3*TDError(4)
% Adv(2) = TDError(2) + A^1*TDError(3) + A^2*TDError(4)
% Adv(3) = TDError(3) + A^1*TDError(4)
% Adv(4) = TDError(4)
% ...
% Adv = [TDError(1) TDError(2) ... TDError(4)] * [ 1 0 0 0
% A^1 1 0 0
% A^2 A^1 1 0
% A^3 A^2 A^1 1]
% Adv = TDError * DiscountWeights
WeightsMatrix = repmat((0:SequenceLength-1)',1,SequenceLength) - (0:SequenceLength-1);
%WeightsMatrix =
% [0 -1 -2 -3 -4
% 1 0 -1 -2 -3
% 2 1 0 -1 -2
% 3 2 1 0 -1
% 4 3 2 1 0]
DiscountWeights = tril((DiscountFactor*GAEFactor) .^ WeightsMatrix);
% With A = DiscountFactor*GAELambda, DiscountWeights =
% [ 1 0 0 0
% A^1 1 0 0
% A^2 A^1 1 0
% A^3 A^2 A^1 1]
Advantage = TDError(:)' * DiscountWeights;
end
% Temporal different target = Advantage[s] + V[s]
Advantage = reshape(Advantage, size(CurrentStateValue));
TDTarget = Advantage + CurrentStateValue;
end
%Loss Function
function Loss = actorLossFunction(MeanAndStd, LossVariable)
% Clipped PPO with entropy loss function function for continuous action space
% MeanAndStd: dlarray of current policy action probabilities (model output)
% LossVariable: struct contains
% - SamplingStrategy
% - Action: previous action
% - OldPolicy: old action policy piOld(at|st)
% - Advantage
% - ClipFactor: scalar > 0
% - EntropyLossWeight: scalar where 0 <= EntropyLossWeight <= 1
% Copyright 2019 The MathWorks Inc.
% Extract information from input
Advantage = LossVariable.Advantage;
OldPolicy = LossVariable.OldPolicy;
NumExperience = numel(Advantage);
% compute pi(at|st)
Policy = evaluate(LossVariable.SamplingStrategy, MeanAndStd, LossVariable.Action);
% rt = pi(at|st)/piOld(at|st), avoid division by zero
Ratio = Policy ./ rl.internal.dataTransformation.boundAwayFromZero(OldPolicy);
% obj = rt * At
Advantage = reshape(Advantage, 1, NumExperience);
Objective = Ratio .* Advantage;
ObjectiveClip = max(min(Ratio, 1 + LossVariable.ClipFactor), 1 - LossVariable.ClipFactor) .* Advantage;
% clipped surrogate loss
SurrogateLoss = -sum(min(Objective, ObjectiveClip),'all')/NumExperience;
% entropy loss
EntropyLoss = rl.loss.policyEntropyContinuous(MeanAndStd, ...
LossVariable.EntropyLossWeight,NumExperience);
% total loss
Loss = SurrogateLoss + EntropyLoss;
end
function [trainingPlot, lineReward, lineAveReward] = hBuildFigure()
plotRatio = 16/9;
trainingPlot = figure(...
'Visible','off',...
'HandleVisibility','off', ...
'NumberTitle','off',...
'Name','Cart Pole Custom Training');
trainingPlot.Position(3) = plotRatio * trainingPlot.Position(4);
ax = gca(trainingPlot);
lineReward = animatedline(ax);
lineAveReward = animatedline(ax,'Color','r','LineWidth',3);
xlabel(ax,'Episode');
ylabel(ax,'Reward');
legend(ax,'Cumulative Reward','Average Reward','Location','northwest')
title(ax,'Training Progress');
end
  1 件のコメント
jiayi
jiayi 2023 年 4 月 25 日
What is the Actor.SamplingStrategy and how was it obtained?

サインインしてコメントする。

回答 (2 件)

Anh Tran
Anh Tran 2020 年 12 月 8 日
The approach looks OK, however there is an issue. You must update the agent's actor and critic after each learning iteration. So before call
for episodeCt = 1:numEpisodes
% update actor, critic
agent = setActor(agent,Actor);
agent = setCritic(agent,Critic);
% sim
simout = sim(agent, env);
...
end
Instead of a custom train loop, you can write a custom agent (subclass) to work with a Simulink environment. In this example, we show how to convert a custom train loop into a custom agent. The benefits:
  • Don't recompile Simulink environment each run (your current behavior)
  • Use the train() function and get the reward reporting by default
  • You can also put a debugger inside your custom agent during training
  • With your appraoch, you only update after an episode (or multiple) finished. This is not the case for custom agent, you can learn() whenever

jiayi
jiayi 2023 年 4 月 17 日
What does this line of code mean?
Actor = setLoss(Actor, @actorLossFunction);

カテゴリ

Help Center および File ExchangeReinforcement Learning についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by