Using Reinforcement Learning algorithm to optimize parameter(s) of a controller

Question

HazwanDrK 2020 年 7 月 15 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/565388-using-reinforcement-learning-algorithm-to-optimize-parameter-s-of-a-controller

編集済み: Mehrdad Moradi 2021 年 7 月 26 日

Hello,

First of all, I'm relatively new to reinforcement learning. I have a project in which I need to utilize RL to fine tune one, if not many parameters of an already well built controller, in my case it's a discrete controller. I have came across many papers describing the use of RL specifically for control but not many on optimization which I have trouble in understanding the concept. Perhaps, can someone shed some knowledge on the use of RL for parameters fine tuning, like how is it different than the RL controller concept and how is it going to run in parallel with the controller. I'm more than happy if you can share with me the references, if any. Thanks!

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Emmanouil Tzorakoleftherakis 2020 年 7 月 16 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/565388-using-reinforcement-learning-algorithm-to-optimize-parameter-s-of-a-controller#answer_466562

Hi Hazwan,

The main difference between using RL for control vs parameter tuning is that in the first case the policy will directly output, e.g. torque, or whatever is your control input. In the latter case, the output of the policy would be parameter values, e.g., if you are trying to tune a PID, the policy would output 3 numbers, Kp, Ki and Kd. Obviously the observations/inputs to the policy as well as the reward would probably need to be different too.

To your question on how the latter could run in parallel with the controller, I can see two scenarios:

1) Using RL for finding static gains. In this case you train, you get the constant parameter values the RL policy finds, and then you discard the policy and adjust your controller gains with these numbers

2) Using RL for finding dynamic/observation-based parameters. This would be in some sense similar to gain scheduling and for this case you would run the policy in parallel with the controller. The idea would be the same(i.e. the policy would output parameter values) but it would do so all the time, thus updating the controller parameters dynamically based on observations.

Hope that helps.

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

HazwanDrK 2020 年 7 月 17 日

Hi Emmanouil,

Thank you for your response. Suppose I have a simple double integrator model running on a discrete (intermittent) control and I want to specifically tune the input threshold depending on the stochastic system disturbances. Since the controller already dictates the output based on the given parameters,

1) will my RL observation space still be the states, but at the overall time steps? How does this going to perturb my policy, say if I want to take the AC approach?

2) And how will I make use of the reward function? Will the reward be the summation of all the rewards at the overall time steps? How will it be different for both AC networks since I'd assume, there will be no exploration and exploitation of the space.

My apologies for the long dreaded enquiries, as I still cannot quite catch the idea of tuning the parameters of such system. I wish I can get in touch with experts like you more frequently though.

Regards.

Emmanouil Tzorakoleftherakis 2020 年 7 月 17 日

When you say "tune the input threshold", what do you mean? Is the controller generating continuous values but at discrete time steps?

To your questions:

1) The observation space should be set so that the agent has enough information to make an educated decision. System states are often good candidates for control-like problems, but you can also use state transformations, e.g. errors from references, integrals etc.I also don't understand what you mean by "How does this going to perturb my policy, say if I want to take the AC approach?". The policy perturbations in training are based on samples/experiences collected by the environment (including rewards). If you use actor-critic methods, you are perturbing both the actor/policy and the critic which guides the actor's actions.

2) The reward is the instantaneous benefit of being in a state and taking an action (so you get a reward at each step). This is what you need to specify. All other operations (like summing, discounting etc are handled by the agent implementations). The same is true for actor-critic methods. The exploration options you select in agent options decide how much the actor explores. Then the critic is using the received reward to update its weights, and the actor is using the critic's estimates to update its own weights.

HazwanDrK 2020 年 7 月 19 日

MATLAB Online で開く

Hi and thank you again for your response. Yes, you're right. The output is continuous but the control signal is at discrete time steps triggered upon the control input threshold; as in observe continuously, act intermittently.

1) As for observation in a simple double integrator model, may I assume the two states will be enough? Since I would take in the error from references to be my reward function.

%% Define environment
% create observation represented by states
obsInfo = rlNumericSpec([2 1],...
    'LowerLimit',-inf,...
    'UpperLimit',inf); 
obsInfo.Name = 'Double Integrator States';
obsInfo.Description = 'x, dx';
% create action represented by parameter (threshold)
actInfo = rlFiniteSetSpec([0.001 1]);
actInfo.Name = 'Double Integrator Action';

Will this be ok? I doubt my action finite value is true though since I wish to have an array of threshold ranging from 0.001 to 1.

2) As for policy, how do I define my actor and critic network? Since (assuming) my observation is just the states, is the action going to be my input threshold instead of the control input?

3) I partly understand the reward function. If I understood correctly, in the case of tuning the input threshold, both actor-critic network will take in all the instantaneous reward for the action at each time step generated by one particular threshold, summing it up and save it in their network, with the critic evaluating the decision of choosing the said action (threshold). It will do so until the end of the training episode, in which it will select the most rewarding action (threshold).

HazwanDrK 2020 年 7 月 22 日

Sorry I can't quite catch that. My double integrator works about the same way however instead from zero initial position with a step response at certain time step for state x. Do I still need to include position error for my observation though? Also another thing, for custom environment using function names, do I need to specify my own reward function in the step function? It was not being specifically brought up in the DDPG example since I reckoned it's predefined. Thanks!

Emmanouil Tzorakoleftherakis 2020 年 7 月 22 日

I would say you need the error yes if you are planning on tracking a collection of step responses. Otherwise, you would be basically overfitting to a single reference value if that makes sense.

You don't need a separate function for the reward, it can be incorporated in the step function as shown in the link you mentioned.

サインインしてコメントする。

Answer 2

Mehrdad Moradi 2021 年 7 月 26 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/565388-using-reinforcement-learning-algorithm-to-optimize-parameter-s-of-a-controller#answer_753779

編集済み: Mehrdad Moradi 2021 年 7 月 26 日

But how configure RL to find a static gains while action signal a a time series of different values? Is there any guideline about it?

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

Using Reinforcement Learning algorithm to optimize parameter(s) of a controller

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Using Reinforcement Learning algorithm to optimize parameter(s) of a controller

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

8 件のコメント 6 件の古いコメントを表示6 件の古いコメントを非表示

その他の回答 (1 件)

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示