Problems in using Reinforcement Learning Agent

15 ビュー (過去 30 日間)
paolo dini
paolo dini 2021 年 2 月 20 日
コメント済み: paolo dini 2021 年 2 月 24 日
Hello, everyone,
I would like some support in using a DDPG agent with my electric drive model with brushless motor.
Just to give some context, the model consists of the classical equations in three-phase axes of the electromechanical model of a permanent magnet synchronous motor and some subsystems implementing Park and Blondel transformations (which basically serve as a three-phase to a "comfortable" equivalent two-phase in which the electromagnetic torque); there is also an ideal inverter model implemented with state machines in Stateflow.
The model is validated, in fact by applying a FOC control with classical PI-Cascade architecture I am able to solve the trajectory tracking problem on position.
Now I am trying with an agent with Actor-Critic architecture and in particular a DDPG agent.
Similarly to what is shown in the examples I used for both Actor and Critic the classical fullyConnectedLayer+reluLayer repeated structures.
The Reward functions and the Early Stop Flag of the simulation is based on the errors between physical quantities and reference signals, while the observations are all the currents (both three-phase and transformed) and angular position and velocity and also the reference signals (on one of the currents which must always be zero and the reference for the angular position).
The problem I have is that although it is a similar application to the available matlab examples, where it is shown that in a "small" number of episodes the agent learns to control the system, in my case I see that the reward function remains at very low values.
Hoping that someone can help me and suggest modifications, I share the files that constitute the definition of the agent and the Simulink diagram with my electrical drive model.
Thank you for your attention.
Pierpaolo.
  1 件のコメント
paolo dini
paolo dini 2021 年 2 月 23 日
No one is able to suggest something??

サインインしてコメントする。

回答 (2 件)

Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2021 年 2 月 23 日
Hello,
I am assuming you have seen this example already? Seems similar. I don't see the script where you set up DDPG but there could be a lot of things going on:
1) The example above has a lot of the inputs/outputs/quantities of interest in p.u. That makes training smoother as things are already normalized. Not sure if that's the case here
2) Along the same lines as 1), make sure the various terms in the reward signal are scaled properly. For example, in the reward signal you scale the sum of id, iq, theta, omega errors. Should that be the case or should these be scaled separately? How does id and iq error compare to theta error? If they are not scald properly, you won't learn what you want
3) Assuming the reward setup is correct, one DDPG parameter which is often overlooked but is very important is Noise Options. This is also related to what the final layers of the actor look like, but basically, if noise variance is not set properly, the agent won't be able to explore and will be stuck to the same local minimum. As the link suggests, make sure the noise variance value makes sense depending on what your action range lookg like.
Hope that helps

paolo dini
paolo dini 2021 年 2 月 23 日
編集済み: paolo dini 2021 年 2 月 23 日
Thanks for responding!!!
Yes this is the example I initially tried to reproduce but using my electric drive model. Let's say that this is the first example I studied.
As I said, I basically try to do the same thing but with DDPG instead of TD3. It seems really strange to me that the problem could be changing from one type of agent to another.
I'll share the files with you again, as in the meantime I was waiting for help I did some more testing.
Let's say that the goal remains the same, to make a FOC controller, then send to zero the current Id and make the trend of the angular position theta to be the desired one.
Doing some research now I'm trying a reward function made of 3 pieces, an exponential that goes to zero faster as the quadratic errors are large, and 2 quadratic forms, one for the control effort and one for the other measurable variables.
I have also noticed that the training freezes after one episode. Can you give me an explanation for this phenomenon?
It seems strange to me that I have learned something and the training can stop with such a low reward!
Thank you for your attention.
Pierpaolo.
  2 件のコメント
Emmanouil Tzorakoleftherakis
Emmanouil Tzorakoleftherakis 2021 年 2 月 23 日
Not sure why training freezes, need more info on that.
But the bottom line is that even if you try to solve the same problem, if you change the environment model, you will likely need to retune various parameters such as rewards etc (especially if these are not normalized).
As I mentioned above, if your actions are not normalized, you will need to play with noise options (again even if's a similar problem, these small details change things a lot)
paolo dini
paolo dini 2021 年 2 月 24 日
Hello Emmanouil,
thank you again for your answer.
I will follow your suggestions.
Pierpaolo.

サインインしてコメントする。

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by