メインコンテンツ

rlHindsightReplayMemory

Hindsight replay memory experience buffer

Since R2023a

    Description

    An off-policy reinforcement learning agent stores experiences in a circular experience buffer.

    During training the agent stores each of its experiences (S,A,R,S',D) in the buffer. Here:

    • S is the current observation of the environment.

    • A is the action taken by the agent.

    • R is the reward for taking action A.

    • S' is the next observation after taking action A.

    • D is the is-done signal after taking action A.

    The agent then samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.

    By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object as their experience buffer. For goal-conditioned tasks, where the observation includes both the goal and a goal measurement, you can use an rlHindsightReplayMemory object.

    A hindsight replay memory experience buffer:

    • Generates additional experiences by replacing goals with goal measurements

    • Improves sample efficiency for tasks with sparse rewards

    • Requires a ground-truth reward function and is-done function

    • Is not necessary when you have a well-shaped reward function

    rlHindsightReplayMemory objects uniformly sample experiences from the buffer. To use prioritized nonuniform sampling, which can improve sample efficiency, use an rlHindsightPrioritizedReplayMemory object.

    For more information on hindsight experience replay, see Algorithms.

    Creation

    Description

    buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo) creates a hindsight replay memory experience buffer that is compatible with the observation and action specifications in obsInfo and actInfo, respectively. This syntax sets the RewardFcn, IsDoneFcn, and GoalConditionInfo properties.

    example

    buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo,maxLength) sets the maximum length of the buffer by setting the MaxLength property.

    example

    Input Arguments

    expand all

    Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data types, and names of the observation signals.

    You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

    Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data types, and names of the action signals.

    You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

    Properties

    expand all

    This property is read-only.

    Maximum buffer length, specified as a nonnegative integer.

    To change the maximum buffer length, use the resize function.

    This property is read-only.

    Number of experiences in buffer, specified as a nonnegative integer.

    Goal condition information, specified as a 1-by-N cell array, where N is the number of goal conditions. For the ith goal condition, the corresponding cell of GoalConditionInfo contains a 1-by-4 cell array with the following elements.

    • GoalConditionInfo{i}{1} — Goal measurement channel index.

    • GoalConditionInfo{i}{2} — Goal measurement element indices.

    • GoalConditionInfo{i}{3} — Goal channel index.

    • GoalConditionInfo{i}{4} — Goal element indices.

    The goal measurements in GoalConditionInfo{i}{2} correspond to the goals in GoalConditionInfo{i}{4}.

    As an example, suppose that obsInfo contains specifications for two observation channels. Further, suppose that there is one goal condition where the goal measurements correspond to elements 2 and 3 of the first observation channel, and the goals correspond to elements 4 and 5 of the second observation channel. In this case, the goal condition information is:

    GoalConditionInfo = {{1,[1 2],2,[4 5]}};

    Reward function, specified as a handle to a function with the following signature.

    function reward = myRewardFcn(obs,action,nextObs)

    Here:

    • reward is a scalar reward value.

    • obs is the current observation.

    • act is the action taken from the current observation.

    • nextObs is the next observation after taking the specified action.

    Is-done function, specified as a handle to a function with the following signature.

    function isdone = myIsDoneFcn(obs,action,nextObs)

    Here:

    • isdone is true when the next observation is a terminal condition and false otherwise.

    • obs is the current observation.

    • act is the action taken from the current observation.

    • nextObs is the next observation after taking the specified action.

    Goal measurement sampling strategy, specified as one of the following values.

    • "final" — Use the goal measurement from the end of the trajectory.

    • "episode" — Randomly sample M goal measurements from the trajectory, where M is equal to NumGoalSamples.

    • "future" — Randomly sample M goal measurements from the trajectory, but create hindsight experiences for measurements that were observed at time t+1 or later.

    Number of goal measurements to sample when generating experiences, specified as a positive integer. This parameter is ignored when Strategy is "final".

    Object Functions

    appendAppend experiences to replay memory buffer
    sampleSample experiences from replay memory buffer
    resizeResize replay memory experience buffer
    resetReset environment, agent, experience buffer, or policy object
    allExperiencesReturn all experiences in replay memory buffer
    validateExperienceValidate experiences for replay memory
    generateHindsightExperiencesGenerate hindsight experiences from hindsight experience replay buffer
    getActionInfoObtain action data specifications from reinforcement learning environment, agent, or experience buffer
    getObservationInfoObtain observation data specifications from reinforcement learning environment, agent, or experience buffer

    Examples

    collapse all

    For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

    Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [a, xm, ym, xg, yg, c], where:

    • xg and yg are the goal observations.

    • xm and ym are the goal measurements.

    • a and c are additional observations.

    obsInfo = rlNumericSpec([6 1]);

    Create a specification for a single action.

    actInfo = rlNumericSpec([1 1]);

    Create a DDPG agent from the environment specifications. By default, the agent uses a replay memory experience buffer with uniform sampling.

    agent = rlDDPGAgent(obsInfo,actInfo);

    To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.

    goalConditionInfo = {{1,[2 3],1,[4 5]}};

    Define an is-done function. For this example, the is-done signal is true when the next observation satisfies the goal condition (xm-xg)2+(ym-yg)2<0.1.

    function isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation)
        NextObservation = NextObservation{1};
        xm = NextObservation(2);
        ym = NextObservation(3);
        xg = NextObservation(4);
        yg = NextObservation(5);
        isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1;
    end
    

    Define a reward function. For this example, the reward is 1 when the is-done signal is true and –0.01 otherwise.

    function reward = hindsightRewardFcn1(Observation,Action,NextObservation)
        isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation);
        if isdone
            reward = 1;
        else
            reward = -0.01;
        end
    end
    

    Create a hindsight replay memory buffer with a default maximum length.

    buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
        @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

    Replace the default experience buffer with the hindsight replay memory buffer.

    agent.ExperienceBuffer = buffer;

    For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

    Create observation specification for an environment with two observation channels. For this example, assume that the first observation channel contains the signals [a, xm, ym] and the second observation channel contains the signals [xg, yg, c], where:

    • xg and yg are the goal observations.

    • xm and ym are the goal measurements.

    • a and c are additional observations.

    obsInfo = [rlNumericSpec([3 1]), rlNumericSpec([3 1])];

    Create a specification for a single action.

    actInfo = rlNumericSpec([1 1]);

    To create a hindsight replay memory buffer, first define the goal condition information. The goal measurements are in elements 2 and 3 of the first observation channel and the goals are in elements 1 and 2 of the second observation channel.

    goalConditionInfo = {{1,[2 3],2,[1 2]}};

    Define an is-done function. For this example, the is-done signal is true when the next observation satisfies the goal condition (xm-xg)2+(ym-yg)2<0.1.

    function isdone = hindsightIsDoneFcn2(Observation,Action,NextObservation)
        NextObsCh1 = NextObservation{1};
        NextObsCh2 = NextObservation{2};
        xm = NextObsCh1(2);
        ym = NextObsCh1(3);
        xg = NextObsCh2(1);
        yg = NextObsCh2(2);
        isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1;
    end
    

    Define a reward function. For this example, the reward is 1 when the is-done signal is true and 0 otherwise.

    function reward = hindsightRewardFcn2(Observation,Action,NextObservation)
        isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation);
        if isdone
            reward = 1;
        else
            reward = 0;
        end
    end
    

    Create a hindsight replay memory buffer with a maximum length of 20000.

    buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
        @hindsightRewardFcn2,@hindsightIsDoneFcn2,...
        goalConditionInfo,20000);

    For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

    Create an observation specification for an environment with a single observation channel with eight observations. For this example, assume that the observation channel contains the signals [a, xm, ym, θ, xg, yg, θm,c], where:

    • xg, yg, and θ are the goal observations.

    • xm, ym, and θm are the goal measurements.

    • a and c are additional observations.

    obsInfo = rlNumericSpec([8 1]);

    Create a specification for a single action.

    actInfo = rlNumericSpec([1 1]);

    To create a hindsight replay memory buffer, first define the goal condition information. For this example, define two goal conditions.

    The first goal condition depends on xm, ym, xg, and yg as shown in the following equation.

    (xm-xg)2+(ym-yg)2<0.1

    Specify the information for this goal condition.

    goalConditionInfo1 = {1,[2 3], 1, [5 6]};

    The first goal condition depends on θm and θ as shown in the following equation.

    (θm-θ)2<0.01

    Specify the information for this goal condition.

    goalConditionInfo2 = {1,4,1,7};

    Combine the goal condition information into a cell array.

    goalConditionInfo = {goalConditionInfo1, goalConditionInfo2};

    Define an is-done function that returns true when the next observation satisfies both goal conditions.

    function isdone = hindsightIsDoneFcn3(Observation,Action,NextObservation) 
        NextObservation = NextObservation{1};
        xm = NextObservation(2);
        ym = NextObservation(3);
        xg = NextObservation(5);
        yg = NextObservation(6);
        thetam = NextObservation(7);
        theta = NextObservation(4);
        isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1 ...
          && (thetam-theta)^2 < 0.01;
    end
    

    Define a reward function. For this example, the reward is 1 when the is-done signal is true and –0.01 otherwise.

    function reward = hindsightRewardFcn3(Observation,Action,NextObservation)
        isdone = hindsightIsDoneFcn3(Observation,Action,NextObservation);
        if isdone
            reward = 1;
        else
            reward = -0.01;
        end
    end
    

    Create a hindsight replay memory buffer with a default maximum length.

    buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
        @hindsightRewardFcn3,@hindsightIsDoneFcn3, ...
        goalConditionInfo);

    When you use a hindsight replay memory buffer within your custom agent training loop, you generate experiences at the end of training episode.

    Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [a, xm, ym, xg, yg, c], where:

    • xg and yg are the goal observations.

    • xm and ym are the goal measurements.

    • a and c are additional observations.

    obsInfo = rlNumericSpec([6 1],...
        LowerLimit=0,UpperLimit=[1;5;5;5;5;1]);

    Create a specification for a single action.

    actInfo = rlNumericSpec([1 1],...
        LowerLimit=0,UpperLimit=10);

    To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.

    goalConditionInfo = {{1,[2 3],1,[4 5]}};

    For this example, use hindsightRewardFcn1 as the ground-truth reward function and hindsightIsDoneFcn1 as the termination condition function.

    Create the hindsight replay memory buffer.

    buffer = rlHindsightReplayMemory(obsInfo,actInfo, ...
        @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

    As you train your agent, you add experience trajectories to the experience buffer. For this example, add a random experience trajectory of length 10.

    for i = 1:10
        exp(i).Observation = {obsInfo.UpperLimit.*rand(6,1)};
        exp(i).Action = {actInfo.UpperLimit.*rand(1)};
        exp(i).NextObservation = {obsInfo.UpperLimit.*rand(6,1)};
        exp(i).Reward = 10*rand(1);
        exp(i).IsDone = 0;
    end
    exp(10).IsDone = 1;
    
    append(buffer,exp);

    At the end of the training episode, you generate hindsight experiences from the last trajectory added to the buffer. Generate experiences specifying the length of the last trajectory added to the buffer.

    newExp = generateHindsightExperiences(buffer,10);

    For each experience in the final trajectory, the default "final" sampling strategy generates a new experience where it replaces the goals in Observation and NextObservation with the goal measurements from the final experience in the trajectory.

    To validate this behavior, first view the final goal measurements from exp.

    exp(10).NextObservation{1}(2:3)
    ans = 2×1
    
        0.7277
        0.6803
    
    

    Next, view the goal values for one of the generated experiences. This value should match the final goal measurement.

    newExp(6).Observation{1}(4:5)
    ans = 2×1
    
        0.7277
        0.6803
    
    

    After generating the new experiences, append them to the buffer.

    append(buffer,newExp);

    Limitations

    • Hindsight experience replay does not support agents that use recurrent neural networks.

    Algorithms

    expand all

    References

    [1] Andrychowicz, Marcin, Filip Wolski,Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojiech Zaremba. 'Hindsight experience replay'. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA: 2017.

    Version History

    Introduced in R2023a