There's something odd going on. It's not 0 reward, but it's not growing. I do have that first action method i said implemented in the other question (so for 4 of the continuous actions, it only chooses the first action) and for 1 action it's used every time step. I guess i need to check the logged signals to really determine what's going on. I'm too excited to make it work on the first or second try lol
現在この質問をフォロー中です
- フォローしているコンテンツ フィードに更新が表示されます。
- コミュニケーション基本設定に応じて電子メールを受け取ることができます。
Enforce action space constraints within the environment
27 ビュー (過去 30 日間)
古いコメントを表示
John Doe
2021 年 2 月 24 日
Hi,
My agent is training!!! But it's pretty much 0 reward every episode right now. I think it might be due to this:
contActor does not enforce constraints set by the action specification, therefore, when using this actor, you must enforce action space constraints within the environment.
How can I do this?
Also, is there a way to view the logged signals as the agent is training?
Thanks!
1 件のコメント
採用された回答
Emmanouil Tzorakoleftherakis
2021 年 2 月 24 日
If the environment is in Simulink, you can setup scopes and observe what's happening during training. If the environment is in MATLAB, you need to do some extra work and plot things yourself.
For your contraints question, which agent are you using? Some agents are stochastic and some like DDPG add noise for exploration on top of the action output. To be certain, you can use a saturation block in Simulink or an if statement to clip the action as needed in MATLAB.
28 件のコメント
John Doe
2021 年 2 月 24 日
I'm using Matlab right now. It's a stochastic agent: actor = rlStochasticActorRepresentation(actorNetwork,observationInfo,actionInfo,...
'Observation',{'observation'},actorOpts);
The upper and lower limits are different for different actions. 2 - 50 for 1 and 0.01 to 4 for the other 4
Emmanouil Tzorakoleftherakis
2021 年 2 月 24 日
Right, but which agent are you using? PPO? SAC? PGAgent?
John Doe
2021 年 2 月 24 日
PPO. Pretty much what the RocketLander example does except doubling the actor size
Emmanouil Tzorakoleftherakis
2021 年 2 月 24 日
Thanks. For PPO with continuous actions, we don't enforce the contraints so you need to do that yourself (take a look at the bottom here).
John Doe
2021 年 2 月 24 日
Looks like my issue was that it wasn't getting past 2 steps because my isDone was the opposite of what i wanted. Started with a 1127 reward just now, hit the spot :D
John Doe
2021 年 2 月 26 日
actionInfo = getActionInfo(env);
observationInfo = getObservationInfo(env);
numObs = observationInfo.Dimension(1);
numAct = actionInfo.Dimension(1);
disp([numObs numAct])
criticLayerSizes = [400 300];
actorLayerSizes = [400 300];
criticNetwork = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(criticLayerSizes(1),'Name','CriticFC1', ...
'Weights',sqrt(2/numObs)*(rand(criticLayerSizes(1),numObs)-0.5), ...
'Bias',1e-3*ones(criticLayerSizes(1),1))
reluLayer('Name','CriticRelu1')
fullyConnectedLayer(criticLayerSizes(2),'Name','CriticFC2', ...
'Weights',sqrt(2/criticLayerSizes(1))*(rand(criticLayerSizes(2),criticLayerSizes(1))-0.5), ...
'Bias',1e-3*ones(criticLayerSizes(2),1))
reluLayer('Name','CriticRelu2')
fullyConnectedLayer(1,'Name','CriticOutput', ...
'Weights',sqrt(2/criticLayerSizes(2))*(rand(1,criticLayerSizes(2))-0.5), ...
'Bias',1e-3)];
Create the critic representation.
criticOpts = rlRepresentationOptions('LearnRate',1e-4);
critic = rlValueRepresentation(criticNetwork,observationInfo,'Observation',{'observation'},criticOpts);
Create the actor using a deep neural network with six inputs and two outputs. The outputs of the actor network are the probabilities of taking each possible action pair. Each action pair contains normalized action values for each thruster. The environment step function scales these values to determine the actual thrust values.
actorNetwork = [featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(actorLayerSizes(1),'Name','ActorFC1', ...
'Weights',sqrt(2/numObs)*(rand(actorLayerSizes(1),numObs)-0.5), ...
'Bias',1e-3*ones(actorLayerSizes(1),1))
reluLayer('Name','ActorRelu1')
fullyConnectedLayer(actorLayerSizes(2),'Name','ActorFC2', ...
'Weights',sqrt(2/actorLayerSizes(1))*(rand(actorLayerSizes(2),actorLayerSizes(1))-0.5), ...
'Bias',1e-3*ones(actorLayerSizes(2),1))
reluLayer('Name', 'ActorRelu2')
fullyConnectedLayer(numAct*2,'Name','Action', ...
'Weights',sqrt(2/actorLayerSizes(2))*(rand(numAct*2,actorLayerSizes(2))-0.5), ...
'Bias',1e-3*ones(numAct*2,1))
softmaxLayer('Name','actionProb')];
Create the actor using a stochastic actor representation.
actorOpts = rlRepresentationOptions('LearnRate',1e-4);
actor = rlStochasticActorRepresentation(actorNetwork,observationInfo,actionInfo,...
'Observation',{'observation'},actorOpts);
Specify the agent hyperparameters using an rlPPOAgentOptions object.
agentOpts = rlPPOAgentOptions(...
'ExperienceHorizon',600,...
'ClipFactor',0.02,...
'EntropyLossWeight',0.01,...
'MiniBatchSize',128,...
'NumEpoch',3,...
'AdvantageEstimateMethod','gae',...
'GAEFactor',0.95,...
'SampleTime',Ts,...
'DiscountFactor',0.997);
John Doe
2021 年 3 月 1 日
編集済み: John Doe
2021 年 3 月 1 日
I've used the continuous 1 and 2 input examples to set up the agent now and I think I'm really close, but I can't figure this out for 5 real actions with their own limits. I've simplified things and it still throws mismatch errors. Is there another example with more than 2 actions that I can follow. It's perplexing to figure out which ones to change using the 1 and 2 action examples. Whether something's supposed to be 5, 10 or 1 or 2 :)
Invalid network.
this.Assembler, this.AnalyzedLayers,this.NetworkInfo] = createInternalNeuralNetwork(this);
this = buildNetwork(this);
Model = rl.representation.model.rlLayerModel(Model, UseDevice, ObservationNames, ActionNames);
Model = rl.util.createInternalModelFactory(Model, Options, ObservationNames, ActionNames, InputSize, OutputSize);
Caused by:
Layer 'mean&sdev': Input size mismatch. Size of input to this layer is different from the expected input size.
Inputs to this layer:
from layer 'mp_out' (output size 5)
from layer 'splus' (output size 5)
My code is:
numObs = obsInfo.Dimension(1);
numAct = actInfo.Dimension(1);
disp([numObs, numAct])
8 5
% input path layers (8 by 1 input and a 5 by 1 output)
inPath = [
featureInputLayer(numObs,'Normalization','none','Name','observation')
fullyConnectedLayer(10,'Name', 'ip_fc') % 10 by 1 output
reluLayer('Name', 'ip_relu') % nonlinearity
fullyConnectedLayer(5,'Name','ip_out') ];% 5 by 1 output
meanPath = [
% fullyConnectedLayer(5,'Name', 'mp_fc1') % 5 by 1 output
% reluLayer('Name', 'mp_relu') % nonlinearity
% fullyConnectedLayer(5,'Name','mp_fc2'); % 5 by 1 output
tanhLayer('Name','mp_tanh'); % output range: (-1,1)
scalingLayer('Name','mp_out','Scale',actInfo.UpperLimit(1)) ];
sdevPath = [
% fullyConnectedLayer(5,'Name', 'sp_fc1') % 5 by 1 output
% reluLayer('Name', 'sp_relu') % nonlinearity
% fullyConnectedLayer(5,'Name','sp_fc2'); % 5 by 1 output
softplusLayer('Name', 'splus') ]; % output range: (0,+Inf)
%
% % conctatenate two inputs (along dimension #3) to form a single (10 by 1) output layer
outLayer = concatenationLayer(3,2,'Name','mean&sdev');
disp(outLayer);
% add layers to layerGraph network object
actorNet = layerGraph(inPath);
actorNet = addLayers(actorNet,meanPath);
actorNet = addLayers(actorNet,sdevPath);
actorNet = addLayers(actorNet,outLayer);
% connect layers: the mean value path output MUST be connected to the FIRST input of the concatenation layer
actorNet = connectLayers(actorNet,'ip_out','mp_tanh/in'); % connect output of inPath to meanPath input
actorNet = connectLayers(actorNet,'ip_out','splus/in'); % connect output of inPath to sdevPath input
actorNet = connectLayers(actorNet,'mp_out','mean&sdev/in1');% connect output of meanPath to mean&sdev input #1
actorNet = connectLayers(actorNet,'splus','mean&sdev/in2');% connect output of sdevPath to mean&sdev input #2
Emmanouil Tzorakoleftherakis
2021 年 3 月 1 日
If you are using 20b, I suggest taking a look at the default agents feature. The feature recommends an initial network architecure for you given only obsInfo and actionInfo so that you don't have to create the architecture yourself. See this page for examples
John Doe
2021 年 3 月 2 日
Wow that was easy!! My agent and critic networks succeeded and it's training. HOwever, I'm seeing the actions go much higher and lower than the upper and lowerlimit set in the actionInfo. My environment is forcing them to the correct value, however, this is very sample ineffficient and the training is going pretty slow. Is there a way to enforce the upper and lower limits for each action? Even the same limit for all actions will work better if that can be done (e..g lower 0 and upper 50)
John Doe
2021 年 3 月 2 日
Actually please stand by, the getAction method is returning the right values. I'll need to check why the environment actions are not the same.
John Doe
2021 年 3 月 2 日
It looks like the issue is normalization of the state vector.
The rand example looks to be outputting normalized values between 0 and 1 always.
test = getAction(agent,{rand(obsInfo(1).Dimension)})
This outputs agent actions within limits each time.
However, when I replace it with my actual state values, whose values are all over the place from thousands to decimals, positive and negative, the action values also go haywire to thousands.
I don't set upperlimits or lowerlimits for the state observations. I guess I'll need to set them and try it out? or force normalization of the observations somehow? Any hints?
John Doe
2021 年 3 月 2 日
Never mind. I set the limits on the observations and it's still breaking action limits. Halp!
Emmanouil Tzorakoleftherakis
2021 年 3 月 2 日
Please take a look at this page at the bottom: "For continuous action spaces, this agent does not enforce the constraints set by the action specification. In this case, you must enforce action space constraints within the environment."
A couple of things you can do is:
1) make sure that the mean path in the actor has a mean path that is scaled to the desired range (if you use the default agent feature that should be automatically handled, but you can make sure by extracting the neural network and checking with "getActor" and "getModel" methods)
2) Normalizing observations and actions is always a good idea if that's an option since it leads to more stable training
John Doe
2021 年 3 月 2 日
actorNet = getModel(getActor(agent));
criticNet = getModel(getCritic(agent));
criticNet.Layers
ans =
7×1 Layer array with layers:
1 'input_1' Feature Input 8 features
2 'fc_1' Fully Connected 256 fully connected layer
3 'relu_body' ReLU ReLU
4 'fc_body' Fully Connected 256 fully connected layer
5 'body_output' ReLU ReLU
6 'output' Fully Connected 1 fully connected layer
7 'RepresentationLoss' Regression Output mean-squared-error
actorNet.Layers
ans =
12×1 Layer array with layers:
1 'input_1' Feature Input 8 features
2 'fc_1' Fully Connected 256 fully connected layer
3 'relu_body' ReLU ReLU
4 'fc_body' Fully Connected 256 fully connected layer
5 'body_output' ReLU ReLU
6 'fc_mean' Fully Connected 5 fully connected layer
7 'tanh' Tanh Hyperbolic tangent
8 'scale' ScalingLayer Scaling layer
9 'fc_std' Fully Connected 5 fully connected layer
10 'std' SoftplusLayer Softplus layer
11 'output' Concatenation Concatenation of 2 inputs along dimension 1
12 'RepresentationLoss' Regression Output mean-squared-error
test = getAction(agent,{rand(obsInfo(1).Dimension)})
celldisp(test);
test{1} =
26.4674
3.2738
2.3261
1.9611
2.5504
test = getAction(agent,{[ 3001 -6.8012 3.1039 354.7631 0 0 8.2629e+04 0 ]})
celldisp(test)
test{1} =
1.0e+03 *
0.0020
0.0223
-5.1131
-1.0185
-6.1576
% And during training some action examples align with the example one above:
1.0e+03 *
0.0020
0.0012
-8.4782
4.2069
-2.3160
1.0e+03 *
0.0020
-0.0054
6.1911
5.2785
0.6783
John Doe
2021 年 3 月 2 日
編集済み: John Doe
2021 年 3 月 2 日
I believe the scaling that you mention should be occurring already because it's using the getModel function, but i have no clue. However, the action outputs are out of scale. Normalizing all of our input data will be a humongous task, so if possible, i'd like to make this work without doing that.
Emmanouil Tzorakoleftherakis
2021 年 3 月 2 日
You could simply add 8 scaling gains to transform the observation inputs so that they are in the same order of magnitude. That should be helpful enough.
Also, if you set up the action specifications correctly, those values should transfer to the 'scalingLayer' in the actor. Can you paste the code where you create the action space? Also, a mat file with the actor and critic would be helpful to take a closer look
John Doe
2021 年 3 月 2 日
Action spec:
action_spec = { { ["action1; " ], [ 2 ], [ 50 ] } ...
{ ["action2;"; "action3;"; "action4;"; "action5;"; ], [ 0.01; 0.01; 0.01; 0.01 ], [4; 4; 4; 4 ] } ...
} ;
[ action_info_combined_name, action_info_combined_lower_limits, action_info_combined_upper_limits ] = retrieveCombinedSpecColumn(action_spec);
ActionInfo.Name = action_info_combined_name;
ActionInfo = rlNumericSpec([5 1],'LowerLimit', action_info_combined_lower_limits, 'UpperLimit', action_info_combined_upper_limits);
function [ name, upper, lower ] = retrieveCombinedSpecColumn(spec)
% function to combine action spec columns into one object for specification to RL
for index = 1:3
% Start with first row
concated_array_column = spec{1}{index};
% loop through all the remaining action spec rows
for i=2:length(spec)
% gobble up action info specs into one obj array
if i>2
joined_array = obj;
else
joined_array = concated_array_column;
end
obj = cat(1, joined_array, spec{i}{index});
end
if index == 1
name = obj;
elseif index == 2
upper = obj;
elseif index == 3
lower = obj;
end
end
end
>> ActionInfo
ActionInfo =
rlNumericSpec with properties:
LowerLimit: [5×1 double]
UpperLimit: [5×1 double]
Name: [0×0 string]
Description: [0×0 string]
Dimension: [5 1]
DataType: "double"
>> ActionInfo.UpperLimit
ans =
50
4
4
4
4
>> ActionInfo.LowerLimit
ans =
2.0000
0.0100
0.0100
0.0100
0.0100
>> ActionInfo.Name
ans =
0×0 empty string array
John Doe
2021 年 3 月 2 日
I'm not sure how to export the networks to a mat file. Do i use this deepnetworkdesigner thing to do that? Or do u want the action and critic layer data? I have that for actions, but not for critic. I'm not sure how I'd get data for the critic.
Emmanouil Tzorakoleftherakis
2021 年 3 月 2 日
編集済み: Emmanouil Tzorakoleftherakis
2021 年 3 月 2 日
the scaling layer has the right scale and bias values looks like, but just to be sure, can you send me the mat file too?
save test.mat actor critic agent
Emmanouil Tzorakoleftherakis
2021 年 3 月 2 日
if you type
predict(actorNet,[ 3001 -6.8012 3.1039 354.7631 0 0 8.2629e+04 0 ])
(just another way to do inference), you will see that the first 5 values (which represent the mean) are within the range you want, but the last 5 (st dev) are large since there is no standard way to constrain these. That's why, when you sum these two numbers the final action output is large.
As mentioned in the doc, you should enforce the constraints on the environment side, and as a best practice, scale the inputs to the network as well so that they are in the same order of magnitude.
John Doe
2021 年 3 月 2 日
編集済み: John Doe
2021 年 3 月 2 日
How can I do the scaling of the inputs to the network? That seems like the best way forward.
The environment is already constraining the actions, but the training is extremely sample inefficient and basically bouncing across the upper and lower limits of the actions for hundreds of episodes.
Emmanouil Tzorakoleftherakis
2021 年 3 月 3 日
multiply the observations inside the 'step' function with a number that makes sense
その他の回答 (1 件)
John Doe
2021 年 3 月 17 日
編集済み: John Doe
2021 年 3 月 17 日
Hi,
I feel like i'm really close to getting this. I haven't gotten a successful run yet. For thousands of episodes, the agent continues to use actions way out of the limits. I've tried adding the min/max thing for forcing them in the environment. Do you have any tips on how I can make it converge to stay within the limits? I even tried changing the rewards to be equivalent to be close to the limits.
I'm wondering whether this is perhaps a known issue that is on the roadmap to make the agent pick actions within spec limits for the continuous agent?
5 件のコメント
John Doe
2021 年 3 月 17 日
Please ignore. The actions are within the limits after normalizing the state I guess. They do sometimes go out of bounds to a negative number.
John Doe
2021 年 3 月 17 日
Actually, I'm not sure. It is in limits for some episodes and not for others.
John Doe
2021 年 3 月 18 日
編集済み: John Doe
2021 年 3 月 18 日
Here are some examples that show it goes out of bounds on the very first action itself.
state- [0.3000
0.4906
0.0621
0.0187
0
0
0.4031
0]
actions chosen (on different episodes (limit set to 0.1 and 10):
-226.2024
-225.1637
-109.6427
-52.5793
10.0005
10.0001
525.8961
457.7566
The state is normalized and leads to perfect values when I run the command:
test = getAction(agent,[ 0.3000 0.4906 0.0621 0.0187 0 0 0.4031 0]')
Why does the train function's actor choose values that are way out of whack for thousands of episodes? And why does the getAction command show such intelligent commands?
During the training, the agent's actor goes to thousands and tens of thousands in some episodes. It doesn't matter if I constrain the action in the environment or not. It likes to go all over the place and will not converge in thousands and thousands of episodes. Halp!
John Doe
2021 年 3 月 18 日
Here's an example training. I gave it a negative reward for going outside the bounds of the action. This demonstrates how far outside the range the actor is picking. This same thing occurs for more episodes (5000) , although I don't have a screenshot for that. Surely there must be something I"m doing wrong? How can I make this converge?
John Doe
2021 年 3 月 25 日
I had a bug where I was using normalized values instead of the real values! I was able to solve the environment after that after changing the action to discrete! THanks for all your help and this wonderful toolbox!
参考
カテゴリ
Help Center および File Exchange で Applications についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!エラーが発生しました
ページに変更が加えられたため、アクションを完了できません。ページを再度読み込み、更新された状態を確認してください。
Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
アジア太平洋地域
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)