Main Content

rlHindsightPrioritizedReplayMemory

Hindsight replay memory experience buffer with prioritized sampling

Since R2023a

    Description

    An off-policy reinforcement learning agent stores experiences in a circular experience buffer.

    During training the agent stores each of its experiences (S,A,R,S',D) in the buffer. Here:

    • S is the current observation of the environment.

    • A is the action taken by the agent.

    • R is the reward for taking action A.

    • S' is the next observation after taking action A.

    • D is the is-done signal after taking action A.

    The agent then samples mini-batches of experiences from the buffer and uses these mini-batches to update its actor and critic function approximators.

    By default, built-in off-policy agents (DQN, DDPG, TD3, SAC, MBPO) use an rlReplayMemory object as their experience buffer. For goal-conditioned tasks, where the observation includes both the goal and a goal measurement, you can use an rlHindsightPrioritizedReplayMemory object.

    rlHindsightReplayMemory objects uniformly sample experiences from the buffer. To use prioritized nonuniform sampling, which can improve sample efficiency, use an rlHindsightPrioritizedReplayMemory object.

    A hindsight replay memory experience buffer:

    • Generates additional experiences by replacing goals with goal measurements

    • Improves sample efficiency for tasks with sparse rewards

    • Requires a ground-truth reward function and is-done function

    • Is not necessary when you have a well-shaped reward function

    For more information on hindsight experience replay and prioritized sampling, see Algorithms.

    Creation

    Description

    buffer = rlHindsightPrioritizedReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo) creates a hindsight prioritized replay memory experience buffer that is compatible with the observation and action specifications in obsInfo and actInfo, respectively. This syntax sets the RewardFcn, IsDoneFcn, and GoalConditionInfo properties.

    example

    buffer = rlHindsightReplayMemory(obsInfo,actInfo,rewardFcn,isDoneFcn,goalConditionInfo,maxLength) sets the maximum length of the buffer by setting the MaxLength property.

    Input Arguments

    expand all

    Observation specifications, specified as a reinforcement learning specification object or an array of specification objects defining properties such as dimensions, data types, and names of the observation signals.

    You can extract the observation specifications from an existing environment or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

    Action specifications, specified as a reinforcement learning specification object defining properties such as dimensions, data types, and names of the action signals.

    You can extract the action specifications from an existing environment or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec or rlNumericSpec.

    Properties

    expand all

    This property is read-only.

    Maximum buffer length, specified as a nonnegative integer.

    To change the maximum buffer length, use the resize function.

    This property is read-only.

    Number of experiences in buffer, specified as a nonnegative integer.

    Goal condition information, specified as a 1-by-N cell array, where N is the number of goal conditions. For the ith goal condition, the corresponding cell of GoalConditionInfo contains a 1-by-4 cell array with the following elements.

    • GoalConditionInfo{i}{1} — Goal measurement channel index.

    • GoalConditionInfo{i}{2} — Goal measurement element indices.

    • GoalConditionInfo{i}{3} — Goal channel index.

    • GoalConditionInfo{i}{4} — Goal element indices.

    The goal measurements in GoalConditionInfo{i}{2} correspond to the goals in GoalConditionInfo{i}{4}.

    As an example, suppose that obsInfo contains specifications for two observation channels. Further, suppose that there is one goal condition where the goal measurements correspond to elements 2 and 3 of the first observation channel, and the goals correspond to elements 4 and 5 of the second observation channel. In this case, the goal condition information is:

    GoalConditionInfo = {{1,[1 2],2,[4 5]}};

    Reward function, specified as a handle to a function with the following signature.

    function reward = myRewardFcn(obs,action,nextObs)

    Here:

    • reward is a scalar reward value.

    • obs is the current observation.

    • act is the action taken from the current observation.

    • nextObs is the next observation after taking the specified action.

    Is-done function, specified as a handle to a function with the following signature.

    function isdone = myIsDoneFcn(obs,action,nextObs)

    Here:

    • isdone is true when the next observation is a terminal condition and false otherwise.

    • obs is the current observation.

    • act is the action taken from the current observation.

    • nextObs is the next observation after taking the specified action.

    Goal measurement sampling strategy, specified as one of the following values.

    • "final" — Use the goal measurement from the end of the trajectory.

    • "episode" — Randomly sample M goal measurements from the trajectory, where M is equal to NumGoalSamples.

    • "future" — Randomly sample M goal measurements from the trajectory, but create hindsight experiences for measurements that were observed at time t+1 or later.

    Number of goal measurements to sample when generating experiences, specified as a positive integer. This parameter is ignored when Strategy is "final".

    Priority exponent to control the impact of prioritization during probability computation, specified as a nonnegative scalar less than or equal to 1.

    If the priority exponent is zero, the agent uses uniform sampling.

    Initial value of the importance sampling exponent, specified as a nonnegative scalar less than or equal to 1.

    Number of annealing steps for updating the importance sampling exponent, specified as a positive integer.

    This property is read-only.

    Current value of the importance sampling exponent, specified as a nonnegative scalar less than or equal to 1.

    During training, ImportanceSamplingExponent is linearly increased from InitialImportanceSamplingExponent to 1 over NumAnnealingSteps steps.

    Object Functions

    appendAppend experiences to replay memory buffer
    sampleSample experiences from replay memory buffer
    resizeResize replay memory experience buffer
    resetReset environment, agent, experience buffer, or policy object
    allExperiencesReturn all experiences in replay memory buffer
    validateExperienceValidate experiences for replay memory
    generateHindsightExperiencesGenerate hindsight experiences from hindsight experience replay buffer
    getActionInfoObtain action data specifications from reinforcement learning environment, agent, or experience buffer
    getObservationInfoObtain observation data specifications from reinforcement learning environment, agent, or experience buffer

    Examples

    collapse all

    For this example, create observation and action specifications directly. You can also extract such specifications from your environment.

    Create an observation specification for an environment with a single observation channel with six observations. For this example, assume that the observation channel contains the signals [a, xm, ym, xg, yg, c], where:

    • xg and yg are the goal observations.

    • xm and ym are the goal measurements.

    • a and c are additional observations.

    obsInfo = rlNumericSpec([5 1]);

    Create a specification for a single action.

    actInfo = rlNumericSpec([1 1]);

    Create a DDPG agent from the environment specifications. By default, the agent uses a replay memory experience buffer with uniform sampling.

    agent = rlDDPGAgent(obsInfo,actInfo);

    To create a hindsight replay memory buffer, first define the goal condition information. Both the goals and goal measurements are in the single observation channel. The goal measurements are in elements 2 and 3 of the observation channel and the goals are in elements 4 and 5 of the observation channel.

    goalConditionInfo = {{1,[2 3],1,[4 5]}};

    Define an is-done function. For this example, the is-done signal is true when the next observation satisfies the goal condition (xm-xg)2+(ym-yg)2<0.1.

    function isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation)
        NextObservation = NextObservation{1};
        xm = NextObservation(2);
        ym = NextObservation(3);
        xg = NextObservation(4);
        yg = NextObservation(5);
        isdone = sqrt((xm-xg)^2 + (ym-yg)^2) < 0.1;
    end
    

    Define a reward function. For this example, the reward is 1 when the is-done signal is true and –0.01 otherwise.

    function reward = hindsightRewardFcn1(Observation,Action,NextObservation)
        isdone = hindsightIsDoneFcn1(Observation,Action,NextObservation);
        if isdone
            reward = 1;
        else
            reward = -0.01;
        end
    end
    

    Create a hindsight prioritized replay memory buffer with a default maximum length.

    buffer = rlHindsightPrioritizedReplayMemory(obsInfo,actInfo, ...
        @hindsightRewardFcn1,@hindsightIsDoneFcn1,goalConditionInfo);

    Configure the prioritized replay memory options. For example, set the initial importance sampling exponent to 0.5 and the number of annealing steps for updating the exponent during training to 1e4.

    buffer.NumAnnealingSteps = 1e4;
    buffer.PriorityExponent = 0.5;
    buffer.InitialImportanceSamplingExponent = 0.5;

    Replace the default experience buffer with the hindsight replay memory buffer.

    agent.ExperienceBuffer = buffer;

    Limitations

    • Hindsight prioritized experience replay does not support agents that use recurrent neural networks.

    Algorithms

    expand all

    References

    [1] Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. 'Prioritized experience replay'. arXiv:1511.05952 [Cs] 25 February 2016. https://arxiv.org/abs/1511.05952.

    [2] Andrychowicz, Marcin, Filip Wolski,Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, Pieter Abbeel, and Wojiech Zaremba. 'Hindsight experience replay'. 31st Conference on Neural Information Processing Systems. Long Beach, CA, USA: 2017.

    Version History

    Introduced in R2023a