RL DDPG agent not converging

46 ビュー (過去 30 日間)
Haochen
Haochen 2024 年 11 月 17 日 18:36
Hi,
I am training a DDPG agent to control the single cart with an initial speed moving along a horizontal axis. The RL agent acts as a controller that provides the force in the direction of the axis to assist in its convergence to the origin. It should not be a difficult task , however, after training for many steps, the control effect is still far from optimal.
These are my configurations for the agent and the environment. The optimal policy should be for the force to be equal to zero, meaning that the cart should no longer be moving after it reaches the origin.
The agent by actor critic.
function [agents] = createDDPGAgents(N)
% Function to create two DDPG agents with the same observation and action info.
obsInfo = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1));
actInfo = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
% Define observation and action paths for critic
obsPath = featureInputLayer(prod(obsInfo.Dimension), Name="obsInLyr");
actPath = featureInputLayer(prod(actInfo.Dimension), Name="actInLyr");
% Define common path: concatenate along first dimension
commonPath = [
concatenationLayer(1, 2, Name="concat")
fullyConnectedLayer(30)
reluLayer
fullyConnectedLayer(1)
];
% Add paths to layerGraph network
criticNet = layerGraph(obsPath);
criticNet = addLayers(criticNet, actPath);
criticNet = addLayers(criticNet, commonPath);
% Connect paths
criticNet = connectLayers(criticNet, "obsInLyr", "concat/in1");
criticNet = connectLayers(criticNet, "actInLyr", "concat/in2");
% Plot the network
plot(criticNet)
% Convert to dlnetwork object
criticNet = dlnetwork(criticNet);
% Display the number of weights
summary(criticNet)
% Create the critic approximator object
critic = rlQValueFunction(criticNet, obsInfo, actInfo, ...
ObservationInputNames="obsInLyr", ...
ActionInputNames="actInLyr");
% Check the critic with random observation and action inputs
getValue(critic, {rand(obsInfo.Dimension)}, {rand(actInfo.Dimension)})
% Create a network to be used as underlying actor approximator
actorNet = [
featureInputLayer(prod(obsInfo.Dimension))
fullyConnectedLayer(30)
tanhLayer
fullyConnectedLayer(30)
tanhLayer
fullyConnectedLayer(prod(actInfo.Dimension))
];
% Convert to dlnetwork object
actorNet = dlnetwork(actorNet);
% Display the number of weights
summary(actorNet)
% Create the actor
actor = rlContinuousDeterministicActor(actorNet, obsInfo, actInfo);
%% DDPG Agent Options
agentOptions = rlDDPGAgentOptions(...
'DiscountFactor', 0.98, ...
'MiniBatchSize', 128, ...
'TargetSmoothFactor', 1e-3, ...
'ExperienceBufferLength', 1e6, ...
'SampleTime', -1);
%% Create Two DDPG Agents
agent1 = rlDDPGAgent(actor, critic, agentOptions);
agent2 = rlDDPGAgent(actor, critic, agentOptions);
% Return agents as an array
agents = [agent1, agent2];
agentOptions.NoiseOptions.MeanAttractionConstant = 0.1;
agentOptions.NoiseOptions.StandardDeviation = 0.3;
agentOptions.NoiseOptions.StandardDeviationDecayRate = 8e-4;
agentOptions.NoiseOptions
end
The envrionment:
function [nextObs, reward, isDone, loggedSignals] = myStepFunction(action, loggedSignals,S)
% Environment parameters
nextObs1 = S.A1d*loggedSignals.State + S.B1d*action(1);
nextObs = nextObs1;
loggedSignals.State = nextObs1;
if abs(loggedSignals.State(1))<=0.05 && abs(loggedSignals.State(2))<=0.05
reward1 = 10;
else
reward1 = -1*(1.01*(nextObs1(1))^2 + 1.01*nextObs1(2)^2 + action^2 );
if reward1 <= -1000
reward1 = -1000;
end
end
reward = reward1;
if abs(loggedSignals.State(1))<=0.02 && abs(loggedSignals.State(2))<=0.02
isDone = true;
else
isDone = false;
end
end
And this is the simulation setup (i omitted the reset function here, and S.N = 1):
obsInfo1 = rlNumericSpec([2 1],'LowerLimit',-100*ones(2,1),'UpperLimit',100*ones(2,1)) ;
actInfo1 = rlNumericSpec([N 1],'LowerLimit',-100*ones(N,1),'UpperLimit',100*ones(N,1));
stepFn1 = @(action, loggedSignals) myStepFunction(action, loggedSignals, S);
resetFn1 = @() myResetFunction(pos1);
env = rlFunctionEnv(obsInfo1, actInfo1, stepFn1, resetFn1);
%% Specify agent initialization
agent= createDDPGAgents(S.N);
loggedSignals = [];
trainOpts = rlTrainingOptions(...
StopOnError="on",...
MaxEpisodes=1000,... %1100 for fully trained
MaxStepsPerEpisode=1000,...
StopTrainingCriteria="AverageReward",...
StopTrainingValue=480,...
Plots="training-progress");
%"training-progress"
train(agent, env, trainOpts);
This is the reward plot wher it it taking very long time for each episode, bt still no signs of reaching the positive reward for this simple system.
And this is the control effect on both states, whichi shows that the RL agent is controlling the a cart to the wrong position near -1 while its velocity is 0.
It is very wierd that the reward is not converging to the positive reward one, but to another point. Can I ask where the problem could be. Thanks.
Haochen

回答 (0 件)

カテゴリ

Help Center および File ExchangeReinforcement Learning についてさらに検索

製品


リリース

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by