DDPG Agent (used to set a temperature) 41% faster training time per Episode with Warm-up than without. Why?

10 ビュー (過去 30 日間)
Hi,
So I noticed something while training my DDPG Agent.
I use a DDPG Agent to set a temperature for a heating system depending on the weather forecast and other temperatures such as the outside temperature.
First I trained an Agent without any warm-up and then I trained another new Agent with a warm-up of 700 episodes. It did what I had hoped, converging faster and finding a much better strategy than without the warm-up. I also noticed that the training time was much faster. I have calculated that it takes 41% less time to train an episode than the training time for one episode without a warm-up.
Don't get me wrong, I really appreciate this, but I am trying to understand why.
I have not changed any of the agent options, just the warm-up.
If the agent is supposed to win a game as quickly as possible, I would understand that because of the experience in the warm-up, the agent would find a better strategy faster to win the game faster, so it would take less time per episode to win the game, but in my case the agent should just set a temperature. There is no faster way to set a temperature.
Am I missing an important point?
I mean, in every training step and every episode the process is more or less the same. Set an action, get a reward, update the networks, update the policy and so on. Where in those steps could the 41% time improvement be?
Just to be clear, I understand why it converges faster, I just don't understand why the training time per episode is so much faster. Without a warm-up, the average training time per episode was 28.1 seconds. With a warm-up it was 16.5 seconds.
These are my agent options, which I used for both agents:
agent.AgentOptions.TargetSmoothFactor = 1e-3;
agent.AgentOptions.DiscountFactor = 1.0;
agent.AgentOptions.MiniBatchSize = 128;
agent.AgentOptions.ExperienceBufferLength = 1e6;
agent.AgentOptions.NoiseOptions.Variance = 0.5;
agent.AgentOptions.NoiseOptions.VarianceDecayRate = 1e-6;
agentOptions.ResetExperienceBufferBeforeTraining = false;
agent.AgentOptions.CriticOptimizerOptions.LearnRate = 1e-03;
agent.AgentOptions.CriticOptimizerOptions.GradientThreshold = 1;
agent.AgentOptions.ActorOptimizerOptions.LearnRate = 1e-04;
agent.AgentOptions.ActorOptimizerOptions.GradientThreshold = 1;
I also use the Reinforcement Learning Toolbox and normalised all my variables in both cases.
In general, everything works fine, but it drives me crazy that I can't understand why it's so much faster.
Maybe someone has an idea.

採用された回答

Venu
Venu 2024 年 1 月 13 日
Based on info u have provided, I can infer the following points:
  1. With warm-up experiences, the agent might be exploring the state and action space more efficiently.
  2. The learning rates for your critic and actor networks are set to allow for small updates. With a good initial experience buffer, the updates may be more stable and require fewer adjustments, leading to faster convergence and less time spent on each gradient update step.
  3. You mentioned that 'agentOptions.ResetExperienceBufferBeforeTraining' is set to 'false'. If the buffer is not reset, the agent with warm-up starts with a full buffer of experiences, which could lead to more efficient sampling and less time waiting for the buffer to fill up.
  1 件のコメント
Milan B
Milan B 2024 年 1 月 13 日
@Venu thanks for the Answer.
Interesting aspects! Especially the second point regarding the "less time spent on each gradient update step". Does this mean that the gradients are updated more efficiently due to the better quality of experiences drawn from the Buffer? I am currently using the L2-Norm as a Gradient Threshold Method with a set Gradient Threshold of 1. My understanding is that if the gradient updates are suboptimal due to insufficient experience, it's more likely that this threshold will be exceeded. Consequently, this necessitates clipping the gradient using the L2-Norm, which is a time-consuming process.
Could this be a possible explanation? I mean, sure there are other factors, but this is what I thought when I heard faster gradient updates.

サインインしてコメントする。

その他の回答 (0 件)

カテゴリ

Help Center および File ExchangeReinforcement Learning についてさらに検索

製品


リリース

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by