DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

Question

Milan B 2024 年 1 月 12 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2069226-ddpg-agent-used-to-set-a-temperature-41-faster-training-time-per-episode-with-warm-up-than-withou

コメント済み: Milan B 2024 年 1 月 13 日

採用された回答: Venu

MATLAB Online で開く

Hi,

So I noticed something while training my DDPG Agent.

I use a DDPG Agent to set a temperature for a heating system depending on the weather forecast and other temperatures such as the outside temperature.

First I trained an Agent without any warm-up and then I trained another new Agent with a warm-up of 700 episodes. It did what I had hoped, converging faster and finding a much better strategy than without the warm-up. I also noticed that the training time was much faster. I have calculated that it takes 41% less time to train an episode than the training time for one episode without a warm-up.

Don't get me wrong, I really appreciate this, but I am trying to understand why.

I have not changed any of the agent options, just the warm-up.

If the agent is supposed to win a game as quickly as possible, I would understand that because of the experience in the warm-up, the agent would find a better strategy faster to win the game faster, so it would take less time per episode to win the game, but in my case the agent should just set a temperature. There is no faster way to set a temperature.

Am I missing an important point?

I mean, in every training step and every episode the process is more or less the same. Set an action, get a reward, update the networks, update the policy and so on. Where in those steps could the 41% time improvement be?

Just to be clear, I understand why it converges faster, I just don't understand why the training time per episode is so much faster. Without a warm-up, the average training time per episode was 28.1 seconds. With a warm-up it was 16.5 seconds.

These are my agent options, which I used for both agents:

agent.AgentOptions.TargetSmoothFactor = 1e-3;
agent.AgentOptions.DiscountFactor = 1.0;
agent.AgentOptions.MiniBatchSize = 128;
agent.AgentOptions.ExperienceBufferLength = 1e6; 
agent.AgentOptions.NoiseOptions.Variance = 0.5;
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-6;
agentOptions.ResetExperienceBufferBeforeTraining = false;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;

I also use the Reinforcement Learning Toolbox and normalised all my variables in both cases.

In general, everything works fine, but it drives me crazy that I can't understand why it's so much faster.

Maybe someone has an idea.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Venu 2024 年 1 月 13 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/2069226-ddpg-agent-used-to-set-a-temperature-41-faster-training-time-per-episode-with-warm-up-than-withou#answer_1388826

Hi @Milan B,

Based on info u have provided, I can infer the following points:

With warm-up experiences, the agent might be exploring the state and action space more efficiently.
The learning rates for your critic and actor networks are set to allow for small updates. With a good initial experience buffer, the updates may be more stable and require fewer adjustments, leading to faster convergence and less time spent on each gradient update step.
You mentioned that 'agentOptions.ResetExperienceBufferBeforeTraining' is set to 'false'. If the buffer is not reset, the agent with warm-up starts with a full buffer of experiences, which could lead to more efficient sampling and less time waiting for the buffer to fill up.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Milan B 2024 年 1 月 13 日

@Venu thanks for the Answer.

Interesting aspects! Especially the second point regarding the "less time spent on each gradient update step". Does this mean that the gradients are updated more efficiently due to the better quality of experiences drawn from the Buffer? I am currently using the L2-Norm as a Gradient Threshold Method with a set Gradient Threshold of 1. My understanding is that if the gradient updates are suboptimal due to insufficient experience, it's more likely that this threshold will be exceeded. Consequently, this necessitates clipping the gradient using the L2-Norm, which is a time-consuming process.

Could this be a possible explanation? I mean, sure there are other factors, but this is what I thought when I heard faster gradient updates.

サインインしてコメントする。

DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示