generateHindsightExperiences
Generate hindsight experiences from hindsight experience replay buffer
Since R2023a
Description
generates hindsight experiences from the last trajectory added to the specified hindsight
experience replay memory buffer.experience
= generateHindsightExperiences(buffer
,trajectoryLength
)
Examples
When you use a hindsight replay memory buffer within your custom agent training loop, you generate experiences at the end of training episode.
Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [, , , , , ], where:
and are the goal observations.
and are the goal measurements.
and are additional observations.
obsInfo = rlNumericSpec([6 1],...
LowerLimit=0,UpperLimit=[1;5;5;5;5;1]);
Create a specification for a single action.
actInfo = rlNumericSpec([1 1],...
LowerLimit=0,UpperLimit=10);
To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.
goalConditionInfo = {{1,[2 3],1,[4 5]}};
For this example, use hindsightRewardFcn1
as the ground-truth reward function and hindsightIsDoneFcn1
as the termination condition function.
Create the hindsight replay memory buffer.
buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
@hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);
As you train your agent, you add experience trajectories to the experience buffer. For this example, add a random experience trajectory of length 10.
for i = 1:10 exp(i).Observation = {obsInfo.UpperLimit.*rand(6,1)}; exp(i).Action = {actInfo.UpperLimit.*rand(1)}; exp(i).NextObservation = {obsInfo.UpperLimit.*rand(6,1)}; exp(i).Reward = 10*rand(1); exp(i).IsDone = 0; end exp(10).IsDone = 1; append(buffer,exp);
At the end of the training episode, you generate hindsight experiences from the last trajectory added to the buffer. Generate experiences specifying the length of the last trajectory added to the buffer.
newExp = generateHindsightExperiences(buffer,10);
For each experience in the final trajectory, the default "final"
sampling strategy generates a new experience where it replaces the goals in Observation
and NextObservation
with the goal measurements from the final experience in the trajectory.
To validate this behavior, first view the final goal measurements from exp
.
exp(10).NextObservation{1}(2:3)
ans = 2×1
0.7277
0.6803
Next, view the goal values for one of the generated experiences. This value should match the final goal measurement.
newExp(6).Observation{1}(4:5)
ans = 2×1
0.7277
0.6803
After generating the new experiences, append them to the buffer.
append(buffer,newExp);
Input Arguments
Hindsight experience buffer, specified as one of the following replay memory objects.
Length of last trajectory in buffer, specified as a positive integer.
Output Arguments
Experiences sampled from the buffer, returned as a structure with the following fields.
Observation, returned as a cell array with length equal to the number of
observation specifications specified when creating the buffer. Each element of
Observation
contains a
DO-by-batchSize
-by-SequenceLength
array, where DO is the dimension of the
corresponding observation specification.
Agent action, returned as a cell array with length equal to the number of
action specifications specified when creating the buffer. Each element of
Action
contains a
DA-by-batchSize
-by-SequenceLength
array, where DA is the dimension of the
corresponding action specification.
Reward value obtained by taking the specified action from the observation,
returned as a 1-by-1-by-SequenceLength
array.
Next observation reached by taking the specified action from the observation,
returned as a cell array with the same format as
Observation
.
Termination signal, returned as a
1-by-1-by-SequenceLength
array of integers. Each element of
IsDone
has one of the following values.
0
— This experience is not the end of an episode.1
— The episode terminated because the environment generated a termination signal.2
— The episode terminated by reaching the maximum episode length.
Length of last trajectory in experience buffer, specified as a positive integer.
Version History
Introduced in R2023a
See Also
Functions
Objects
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Web サイトの選択
Web サイトを選択すると、翻訳されたコンテンツにアクセスし、地域のイベントやサービスを確認できます。現在の位置情報に基づき、次のサイトの選択を推奨します:
また、以下のリストから Web サイトを選択することもできます。
最適なサイトパフォーマンスの取得方法
中国のサイト (中国語または英語) を選択することで、最適なサイトパフォーマンスが得られます。その他の国の MathWorks のサイトは、お客様の地域からのアクセスが最適化されていません。
南北アメリカ
- América Latina (Español)
- Canada (English)
- United States (English)
ヨーロッパ
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)