DDPG training converges to the worst results obtained during exploration

Question

Alessandro Fasiello 2024 年 1 月 24 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2073681-ddpg-training-converges-to-the-worst-results-obtained-during-exploration

コメント済み: Alessandro Fasiello 2024 年 2 月 3 日

I'm using Matlab R2020b because it is the version istalled on the HPC system of the department.

The DDPG agent that I defined shows a good behaviour during the early stages of the training, improoving in a constant way the obtained reward, the automatically saved agents that obtained the highest rewards, in fact, show that the system is effectively learning a good policy and mooving towards the right direction.

Despite this first incoraging behaviour, improvvisely the reward per episode plot collapse toward a very low value, showing that the learning has converged towards one of the worst policies explored (as seen in pictures).

In the second training showed, I've halved the learning rate of both actor and critic (they were initially set to 0.0001 and 0.001).

The used optimizer is 'Adam', with GradientThreshold = 0.1, actorL2RegularizationFactor = 1e-5, criticL2RegularizationFactor = 2e-4.

Actor NN is built as follow:

actorPath = [
    featureInputLayer(obsInfo.Dimension(1), 'Name', 'obsInLyr')
    fullyConnectedLayer(300,'Name','fc1_600')
    reluLayer('Name','relu_600')
    fullyConnectedLayer(600,'Name','fc2_600')
    fullyConnectedLayer(600,'Name','fc3_600')
    reluLayer('Name','relu_600')
    fullyConnectedLayer(actInfo.Dimension(1), 'Name', 'fc4_2' )
    tanhLayer('Name','tanh')
    scalingLayer('Name','actionOutLyr','Scale',[(anMax+apMax)/2; maxSteering],'Bias',[apMax-anMax; 0])
    ];

CriticNN is built as:

obsPath = [featureInputLayer(obsInfo.Dimension(1), 'Name', 'obsInLyr')
    fullyConnectedLayer(600,'Name','fc1_600')
    ];
actPath = [featureInputLayer(actInfo.Dimension(1), 'Name', 'actInLyr')
    fullyConnectedLayer(600,'Name','fc2_600')
    ];
commPath = [concatenationLayer(1,2,'Name','conc')
    reluLayer('Name','relu_1200')
    fullyConnectedLayer(600,'Name','fc3_600')
    reluLayer('Name','relu_600')
    fullyConnectedLayer(1, 'Name', 'QValue' )
    ];
criticNetwork = layerGraph();
criticNetwork = addLayers(criticNetwork,obsPath);
criticNetwork = addLayers(criticNetwork,actPath);
criticNetwork = addLayers(criticNetwork,commPath);
criticNetwork = connectLayers(criticNetwork,'fc1_600','conc/in1');
criticNetwork = connectLayers(criticNetwork,'fc2_600','conc/in2');

Other AgentOptions are:

ulisseAgent.AgentOptions.SampleTime = 0.05;
ulisseAgent.AgentOptions.DiscountFactor = 0.99;
ulisseAgent.AgentOptions.MiniBatchSize = 64;
ulisseAgent.AgentOptions.ExperienceBufferLength = 1e6;
ulisseAgent.AgentOptions.TargetSmoothFactor = 1e-3;
ulisseAgent.AgentOptions.NoiseOptions.MeanAttractionConstant = 0.15;
ulisseAgent.AgentOptions.NoiseOptions.Variance = [0.3;5];
ulisseAgent.AgentOptions.NoiseOptions.Mean = [0;0];
ulisseAgent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-5;

The action space is bidimensional, first action is limited between -3 and 3, while second action is limited between -40 and 40.

The reward is built as the sum of different value of the agent behaviour, that are built as Huber functions so to make the gradient of each reward component continue in the optimal point neighborhood.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Emmanouil Tzorakoleftherakis 2024 年 1 月 24 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2073681-ddpg-training-converges-to-the-worst-results-obtained-during-exploration#answer_1396796

I cannot see your training options, but what do you mean by "converges"? The training plot only shows about 1800 episodes. There is in general no guarantee that the average reward will monotonically increase throughout. Exploration may move the training from a "good" point to a "bad" one as it happens here apparently. A few things to consider:

1) You still have decaying variance in your exploration. Towards the end of the training you are showing, the agent may not be able to get out of this low-reward region because it's exploring less. If it is allowed to explore, it is likely it will come back to a higher reward region in later episodes.

2) It's up to you really when you want to stop the agent. If you feel the agent you have around episode 700 is good enough, you may choose to stop training.

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

Alessandro Fasiello 2024 年 1 月 24 日

MATLAB Online で開く

Thank you for your reply!

What I mean with "converge" is that the agent stabilize on a strategy that ever results in the same behaviour (in this case, being an autonomous vehicle, it stabilize in exiting as soon as possible from its lane, triggering the "done" condition of the environment).

By the way, the training options are the following:

trainOptions = rlTrainingOptions();
trainOptions.MaxEpisodes = 4e3;
trainOptions.MaxStepsPerEpisode = 600;
trainOptions.ScoreAveragingWindowLength = 200;
trainOptions.StopTrainingCriteria = "EpisodeReward";
trainOptions.StopTrainingValue = 600*12;
trainOptions.SaveAgentCriteria = "EpisodeReward";
trainOptions.SaveAgentValue = 600*5;
trainOptions.SaveAgentDirectory = strcat("../../out/trainedAgents_",dateString);
trainOptions.Verbose = 1;

As far as concernes the number of steps, the second training reached around 250000 steps.

Moreover, the change in the behaviour coincides with the sudden stabilization of the initial Q0 value (the yellow one).

I noticed that in the picture there is no legend: the blue signal represents the Episode Reward, the orange one is the Averege Reward calculated over the last 200 episodes, the yellow is the initial Q0.

Alessandro Fasiello 2024 年 1 月 25 日

Thank you, I will try and update you as soon as possible

Alessandro Fasiello 2024 年 2 月 3 日

Further experiments have shown the same behaviour even with a lower decay rate and the minimum variance setting, whith little if no evident corelation with the number of steps or number of episodes.

Right now, a new training is running with a much simpler reward function, and for now, after 3e5 steps in 1800 episodes, it is not showing signs of the discussed problem.

サインインしてコメントする。

DDPG training converges to the worst results obtained during exploration

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

DDPG training converges to the worst results obtained during exploration

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

回答 (1 件)

5 件のコメント 3 件の古いコメントを表示3 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示3 件の古いコメントを非表示