why can not output optimal solution when validate agent?

Question

Kun Cheng 2023 年 6 月 7 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1979864-why-can-not-output-optimal-solution-when-validate-agent

回答済み: Shivansh 2023 年 9 月 14 日

Hello everyone,

Topic: Reinforcement Learning, DQN Agent.

I have trained an agent with my dataset (total 28 training data) then validated all these data. Problem is i can not get optimal results at validation. Some of them were good but not every result was good.

env: I custermized an environment.
I create critic with this function: critic = rlVectorQValueFunction(nn,obsInfo,actInfo);
With critic create an dqn agent: agent = rlDQNAgent(critic);

I have tried new agent with only 1 data. Training could get converged. Validation gave also right answer to this data. But i trained an agent with all 28 data using the same hyperparameter. Correctness is not garanteed.... I don't know what is reason. Because of too small dataset? or i gave wrong hyperparameter?

Hyperparameter of agent:

agent.AgentOptions.EpsilonGreedyExploration.EpsilonDecay = 0.9;

agent.AgentOptions.EpsilonGreedyExploration.Epsilon = 0.9;

agent.AgentOptions.EpsilonGreedyExploration.EpsilonMin = 0.001;

agent.AgentOptions.DiscountFactor = 0.99;

agent.AgentOptions.MiniBatchSize = 128;

agent.AgentOptions.CriticOptimizerOptions.LearnRate = 0.0008;

agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;

agent.AgentOptions.SaveExperienceBufferWithAgent=true;

Thank you

Kun

2 件のコメント
なしを表示なしを非表示

Emmanouil Tzorakoleftherakis 2023 年 6 月 13 日

Are you using an IsDone signal? What do you mean by 28 training data? Do you mean 28 episodes? If that's the case, this number is really small. You need to at least give it a few hundred episodes to get an idea of how training progresses.

Kun Cheng 2023 年 6 月 14 日

編集済み: Kun Cheng 2023 年 6 月 14 日

MATLAB Online で開く

Hello,

I mean 28 trainings, 28 training samples in 1 epoch.

For example i trained a data learn curve 1:

converged at right position, no problem. i did change any hyperparameter and started second. learn curve 2:

converged to suboptimal position.

it happens in all 28 trainings of the training data set. some of them converged properly. Others converged to suboptimal position.

my problem is i do not know how to deal with it. Should i do more training with this training data set (2, 3, or more epochs) until all these can converged to right position? otherwise i will train the agent with new training data set?

PS: paste additional information about hyperparameters. I am not sure if is some problems there

            agent.AgentOptions.EpsilonGreedyExploration.EpsilonDecay = 0.0001;
            agent.AgentOptions.EpsilonGreedyExploration.Epsilon = 0.9;
            agent.AgentOptions.EpsilonGreedyExploration.EpsilonMin = 0.0001;
            agent.AgentOptions.DiscountFactor = 0.99;
            agent.AgentOptions.MiniBatchSize = 128;
            agent.AgentOptions.CriticOptimizerOptions.LearnRate = 0.0001;%0.0008
            agent.AgentOptions.CriticOptimizerOptions.L2RegularizationFactor = 2e-4;
            agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
            agent.AgentOptions.SaveExperienceBufferWithAgent=true;
            
            % neural network
            Layer_WB_5_2 = fullyConnectedLayer(128, 'Name', 'WB_5_2', 'WeightLearnRateFactor', 1, 'BiasLearnRateFactor', 1);
            nn = [
                 featureInputLayer(obsInfo.Dimension(1))
                 Layer_WB_5_2
                 reluLayer
                 fullyConnectedLayer(length(actInfo.Elements))
                 ];
            rng(0)
            nn = dlnetwork(nn);
            summary(nn)
            critic = rlVectorQValueFunction(nn,obsInfo,actInfo);

Thanks

Kun

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Shivansh 2023 年 9 月 14 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/1979864-why-can-not-output-optimal-solution-when-validate-agent#answer_1309871

Hi Kun,

I understand that you are training your agent with 28 training samples in one epoch, and you are observing that some of the training runs converge to the desired position while others converge to suboptimal positions. Please have a look at these workarounds:

Increase the number of epochs: Training for multiple epochs can help the agent learn more effectively and converge to better solutions. You can try training for more epochs (e.g., 2, 3, or more) and observe if the convergence improves for the suboptimal cases. This case can also lead to overfitting as multiple iterations can lead to memorization rather than generalization.
Collect more diverse training data: If possible, try to collect more diverse training data that covers a wider range of scenarios and states. This can help the agent learn a more robust policy by exposing it to a greater variety of situations.
Adjusting Model architecture and hyperparameters: Collect more diverse training data: If possible, try to collect more diverse training data that covers a wider range of scenarios and states. This can help the agent learn a more robust policy by exposing it to a greater variety of situations. Collect more diverse training data: If possible, try to collect more diverse training data that covers a wider range of scenarios and states. This can help the agent learn a more robust policy by exposing it to a greater variety of situations.
Monitor and analyse the learning process: During training, monitor the learning curves and observe the agent's behaviour. Look for patterns or anomalies that may indicate issues with convergence. Analyse the rewards, loss, and exploration-exploitation trade-off to gain insights into the learning process and identify potential areas for improvement.
Lastly, if DQN still struggle with convergence, you can explore other algorithms like Proximal Policy Optimization, Trust Region Policy Optimization, or Soft Actor Critic.

Reinforcement Learning is an iterative process and often requires fine tuning and experimentation for optimal results. It is crucial to analyse the data well to find the optimal solution.

Hope it helps!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

why can not output optimal solution when validate agent?

2 件のコメント
なしを表示なしを非表示

採用された回答

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

why can not output optimal solution when validate agent?

2 件のコメント なしを表示なしを非表示

採用された回答

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

Community Treasure Hunt

2 件のコメント
なしを表示なしを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示