RL PPO agent diverges with one-step training

Question

Haochen 2024 年 6 月 17 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2129301-rl-ppo-agent-diverges-with-one-step-training

回答済み: Shivansh 2024 年 6 月 27 日

Hi,

I am training my PPO agent based on a system with continuous action space, and I want to have my agent trains for only one step and one episode in each train() function, and see how it performs:

trainingOpts = rlTrainingOptions(...
    MaxEpisodes=1, ...
    MaxStepsPerEpisode=1, ...
    Verbose=false, ...
    Plots="none",...
    StopTrainingCriteria="AverageReward",...
    StopTrainingValue=480);

This is the settings of the agent:

function [agents,obsInfo,actionInfo] = generate_PPOagents(Ts)
    %observation and action spaces 
    obsInfo = rlNumericSpec([2 1],'LowerLimit',-inf*ones(2,1),'UpperLimit',inf*ones(2,1));
    obsInfo.Name = 'state';
    obsInfo.Description = 'position, velocity';
    actionInfo = rlNumericSpec([1 1],'LowerLimit',-inf,'UpperLimit',inf);
    actionInfo.Name = 'continuousAction';
    agentOptions = rlPPOAgentOptions(...
        'DiscountFactor', 0.99,...
        'EntropyLossWeight', 0.01,...
        'ExperienceHorizon', 20,...
        'MiniBatchSize', 20,...
        'ClipFactor', 0.2,...
        'NormalizedAdvantageMethod','none',...
        'SampleTime', -1);
    agent1 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
    agent2 = rlPPOAgent(obsInfo, actionInfo, agentOptions);
    agents = [agent1,agent2];
end

my reward is a conditional one based on whether the states satisfy some conditions:

function [nextObs, reward, isDone, loggedSignals] = myStepFunction1(action, loggedSignals,S)
    nextObs = S.A1d*[loggedSignals.State(1);loggedSignals.State(2)] + S.B1d*action;
    loggedSignals.State = nextObs;
    
    if abs(nextObs(1))>10 || abs(nextObs(2))>10
        reward = S.test-100;
    else
        reward = -1*(nextObs(1)^2 + nextObs(2)^2);
    end
    isDone = false;
end

in this case, every time the system finishes train(), the agent moves forward 1 step using getAction(), then I modify the reset function and then update the env so that each time the next train() simulates, the agent will start at the new state, then do trian() again to carry out the loop. But when I simulate the system, the states diverges to Inf after just around 20 train() iterations, I have checked my env, the agent settings, all seems fine. I tested if the issue is from the penalty in the reward function by changing S.test above, but the simulation fails as well.

I am not sure if the issue is caused by the one episode one step training method, in theory I am expecting bad performance at first but it should not be diverging so fast to Inf.

Thanks.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Shivansh 2024 年 6 月 27 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2129301-rl-ppo-agent-diverges-with-one-step-training#answer_1477791

MATLAB Online で開く

Hi Haochen,

It looks like you are facing numerical instability during the training in your RL model.

It will be helpful if you can provide the training graph for the issue.

If you think the agent is producing excessive large action leading to divergence, you can try limiting the actions in a reasonable bound and saturating the outputs.

actionInfo = rlNumericSpec([1 1],'LowerLimit',-15,'UpperLimit',15); %A sample example 

Since you are training for only one step per episode with both "ExperienceHorizon" and "MiniBatchSize" both set to 20, the agent might not be able to collect enough experiences to perform effective updates.

You can also try to normalize the observations and actions and analyze the impact on the training. You can add the normalization options in "agentOptions" and set them as "true".

The reward function analysis is also a great way to find the issue in the RL training. You can also try adding gradient clipping and reducing the learning rate to avoid aggresive policy updates.

You can refer to the following documentation for more information regarding the PPO agents:

Proximal Policy Optimization (PPO) Agents: https://www.mathworks.com/help/reinforcement-learning/ug/proximal-policy-optimization-agents.html.

I hope this helps!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

RL PPO agent diverges with one-step training

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

RL PPO agent diverges with one-step training

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示