- Reward Scaling: Instead of using negative rewards, consider scaling the rewards to a positive range that aligns with the agent's objective.
- Exploration: Ensure that your agent has sufficient exploration during training. Exploration allows the agent to explore different actions and states, which can help in finding better policies.
why AC-agent converged to minimal ?
1 回表示 (過去 30 日間)
古いコメントを表示
Hello everyone!
I trained an AC-Agent. But agent converged to policy that gives minimal reward. I'm not sure if the problem is with the neural network or the environment. Rewards are negative because i want to find minimal of volumen. I have changed two parameter, lernrate = 0.05, entropylossweight=0.01. other parameter are default. I do not know what parameter should be of particular interest.
I changed lernrate to lower value of 0.0005, then cant converge.
Here ist actor and critic:
I want actor give value between [0 1]
%% neural network
nnc = [
featureInputLayer(prod(obsInfo.Dimension), 'Name', 'input_c')
fullyConnectedLayer(Knoten, 'Name', 'fc_c1')
reluLayer('Name', 'relu1')
fullyConnectedLayer(Knoten, 'Name', 'fc_c2')
reluLayer('Name', 'relu2')
fullyConnectedLayer(1, 'Name', 'output')];
nnc = dlnetwork(nnc);critic = rlValueFunction(nnc,obsInfo);
% getValue(critic,{rand(obsInfo.Dimension)})
input_actor = [
featureInputLayer( ...
prod(obsInfo.Dimension), ...
Name="input_a")
fullyConnectedLayer( ...
prod(actInfo.Dimension), ...
Name="in_fc")
];
nna1 = [
tanhLayer(Name="tanhMean");
fullyConnectedLayer(prod(actInfo.Dimension),"Name", 'fc_mean');
sigmoidLayer(Name="output_mean")
];
nna2 = [
tanhLayer(Name="tanhStdv");
fullyConnectedLayer(prod(actInfo.Dimension),"Name", 'fc_div');
softplusLayer(Name="output_div")
];
nna = layerGraph(input_actor);
nna = addLayers(nna,nna1);
nna = addLayers(nna,nna2);
nna = connectLayers(nna,"in_fc","tanhMean/in");
nna = connectLayers(nna,"in_fc","tanhStdv/in");
% plot(nna)
nna = dlnetwork(nna);
% summary(net)
actor = rlContinuousGaussianActor(nna, obsInfo, actInfo, ...
ActionMeanOutputNames="output_mean",...
ActionStandardDeviationOutputNames="output_div",...
ObservationInputNames="input_a");
the step function in environment
function [nextobs,reward,isdone,loggedSignals] = step(this,action)
% unpack actions
this.Robot.x = action(1);
this.Robot.y = action(2);
this.Robot.NumOfHeight = action(3);
this.Robot.NumOfAngle = action(4);
[~, ~, ~, this.Volumennew, ~]=PrunnedTreeGenerator(this.Robot.x, this.Robot.y, this.Robot.NumOfHeight, this.Robot.NumOfAngle, 3,...
0.8, this.H, this.Bin_In_Training, this.RB, 0.5, 1);
% Assign new state when small volume are found
if this.Volumennew<=min(this.volume_tree_Collection)
this.volume_tree = this.Volumennew;
%disp(this.volume_tree)
this.volume_tree_Collection = [this.volume_tree_Collection;...
this.volume_tree];
end
reward = -this.volume_tree/(0.5^2*pi*sum(this.Bin_In_Training(:, 3)));
% isdone function: step stops, when found distance between
% point that bigger than minimal, a negative reward are given
%isdone = this.Volumennew>=min(this.volume_tree_Collection);
Distance = distanceCalculator(this, this.Robot.x, this.Robot.y);
Mean = meanXYZ(this);
Sigma_Square=getSigma(this);
DivisionSize=getSizeDivision(this);
isdone = this.l>=24;
if ~isdone
this.l=this.l+1;
nextobs = [this.Robot.x, this.Robot.y, this.Robot.NumOfHeight, this.Robot.NumOfAngle, this.volume_tree, size(this.Bin_In_Training, 1), Distance, Mean, Sigma_Square, DivisionSize]';
%reward = sum(1+this.l)/this.l;
this.State = [this.Robot.x, this.Robot.y, this.Robot.NumOfHeight, this.Robot.NumOfAngle, this.volume_tree, size(this.Bin_In_Training, 1), Distance, Mean, Sigma_Square, DivisionSize]';
% if isdone is false, that means a minimal has been found,
% therefor a positive reward has been given
else
this.l=this.l+1;
%disp(this.State)
%disp(this.volume_tree)
nextobs = [this.Robot.x, this.Robot.y, this.Robot.NumOfHeight, this.Robot.NumOfAngle, this.volume_tree, size(this.Bin_In_Training, 1), Distance, Mean, Sigma_Square, DivisionSize]';
this.StepState = [this.StepState;this.k this.l nextobs'];
this.k=this.k+1;
%reward = ;
end
this.State = nextobs;
this.IsDone = isdone;
loggedSignals = nextobs;
end
Hope for help!
thanks!
Kun
0 件のコメント
採用された回答
Sugandhi
2023 年 10 月 18 日
Hi,
I understand that the agent is converging to a policy that gives minimal rewards because of the way the rewards are calculated in the ‘step’ function of the environment. The reward calculation is based on the volume of a tree, and the reward is set to be negative, which means the agent is incentivized to minimize the volume.
To understand why the agent is converging to a policy that gives minimal rewards, you need to examine the reward function and the environment dynamics.
reward = -this.volume_tree/(0.5^2*pi*sum(this.Bin_In_Training(:, 3)));
It seems that the goal of your task is to find the minimal volume. However, negative rewards can make convergence challenging, especially if the agent is trained using gradient-based methods.
Few possible workarounds could be:
Reinforcement learning training can be sensitive to various factors, and it often requires experimentation and iterative adjustments to achieve desirable results.
0 件のコメント
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Training and Simulation についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!