Main Content

rlHybridStochasticActor

Hybrid stochastic actor with a hybrid action space for reinforcement learning agents

Since R2024b

    Description

    This object implements a function approximator to be used as a stochastic actor within a reinforcement learning agent with a hybrid action space (partly discrete and partly continuous). A hybrid stochastic actor takes an environment observation as input and returns as output a random action containing a discrete and a continuous part. The discrete part is sampled from a categorical (also known as Multinoulli) probability distribution, and the continuous part is sampled from a parametrized Gaussian probability distribution. After you create an rlHybridStochasticActor object, use it to create an rlSACAgent agent with a hybrid action space. For more information on creating actors and critics, see Create Policies and Value Functions.

    Creation

    Description

    actor = rlHybridStochasticActor(net,observationInfo,actionInfo,DiscreteActionOutputNames=dsctOutLyrName,ContinuousActionMeanOutputNames=meanOutLyrName,ContinuousActionStandardDeviationOutputNames=stdOutLyrName) creates a hybrid stochastic actor with a hybrid action space using the deep neural network net as approximation model. Here, net must have three differently named output layers. One layer must return the probability of each possible discrete action, and the other two layers must return the mean and the standard deviation of the Gaussian distribution of each component of the continuous action, respectively. The actor uses the output of these three layers, according to the names specified in the strings dsctOutLyrName, meanOutLyrName and stdOutLyrName, to represent the probability distributions from which the discrete and continuous components of the action are sampled. This syntax sets the ObservationInfo and ActionInfo properties of actor to the input arguments observationInfo and actionInfo, respectively.

    Note

    actor does not enforce constraints set by the continuous action specification. When using this actor in a different agent than SAC, you must enforce action space constraints within the environment.

    actor = rlHybridStochasticActor(___,Name=Value) specifies names of the observation input layers or sets the UseDevice property using one or more name-value arguments. Use this syntax with any of the input argument combinations in the preceding syntax. Specify the input layer names to explicitly associate the layers of your network with specific environment channels. To specify the device where computations for actor are executed, set the UseDevice property, for example UseDevice="gpu".

    example

    Input Arguments

    expand all

    Deep neural network used as the underlying approximation model within the actor. It must have as many input layers as the number of environment observation channels (with each input layer receiving input from an observation channel).

    The network must have three differently named output layers. One must return the probability of each possible discrete action, and the other two must return the mean and the standard deviation of the Gaussian distribution of each component of the continuous action, respectively. The discrete action probability layer must have the same number of outputs as the number of possible discrete actions, as specified in the first component actionInfo. The continuous action mean and standard deviation layers must both have the same number of outputs as the number of dimensions of the continuous action channel, as specified in the second component of actionInfo.

    The actor uses the output of these three layers, according to the names specified in the strings dsctOutLyrName, meanOutLyrName and stdOutLyrName, to represent the probability distributions from which the discrete and continuous components of the action are sampled.

    Note

    Since standard deviations must be nonnegative, the output layer that returns the standard deviations must be a softplus or ReLU layer, to enforce nonnegativity. Also, unless the actor is used in a SAC agent, the mean values must fall within the range of the action. Therefore, when you use the actor in a different agent than SAC, to scale the mean values to the output range, use a scaling layer as the output layer for the mean values, preceded by an hyperbolic tangent layer. SAC agents automatically read the action range from the UpperLimit and LowerLimit properties of the action specification and then internally scale the distribution and bounds the action. Therefore, if the actor must be used in a SAC agent, do not add any layer that scales or bounds the mean values output. For more information, see Soft Actor-Critic (SAC) Agent.

    You can specify the network as one of the following:

    Note

    Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using them to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any neural network object from the Deep Learning Toolbox™. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

    rlHybridStochasticActor objects support recurrent deep neural networks.

    The learnable parameters of the actor are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policies and Value Functions.

    Name of the network output layer returning the probabilities of each possible discrete action, specified as a string or character vector. The actor uses this name to select the network output layer that returns the probabilities of each element of the discrete action channel. Therefore, within net, this layer must be named as indicated in dsctOutLyrName.

    Example: "probDiscreteActOutLyr"

    Name of the network output layer returning the mean values of the continuous action, specified as a string or character vector. The actor uses this name to select the network output layer that returns the mean values of each element of the action channel. Therefore, within net, this layer must be named as indicated in meanOutLyrName.

    Example: "meanContinuousActOutLyr"

    Name of the network output layer returning the standard deviations of the continuous action, specified as a string or character vector. The actor uses this name to select the network output layer that returns the standard deviations of each element of the action channel. Therefore, within net, this layer must be named as indicated in stdOutLyrName. To enforce nonnegativity of the returned standard deviations, this layer must be a softplus or ReLU layer.

    Example: "stdContinuousActOutLyr"

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: UseDevice="gpu"

    Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of strings or character vectors. The function assigns, in sequential order, each environment observation channel specified in observationInfo to each layer whose name is specified in the array assigned to this argument. Therefore, the specified net input layers, ordered as indicated in this argument, must have the same data type and dimensions as the observation channels, as ordered in observationInfo.

    Example: ObservationInputNames={"obsInLyr1_airspeed","obsInLyr2_altitude"}

    Properties

    expand all

    Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

    When you create the approximator object, the constructor function sets the ObservationInfo property to the input argument observationInfo.

    You can extract observationInfo from an existing environment, function approximator, or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

    Example: [rlNumericSpec([2 1]) rlFiniteSetSpec([3,5,7])]

    Action specification, specified as a vector consisting of one rlFiniteSetSpec followed by one rlNumericSpec object. The action specification defines the properties of an environment action channel, such as its dimensions, data type, and name.

    Note

    For hybrid action spaces, you must have two action channels, the first one for the discrete part of the action, the second one for the continuous part of the action.

    When you create the approximator object, the constructor function sets the ActionInfo property to the input argument actionInfo.

    You can extract ActionInfo from an existing environment or agent using getActionInfo. You can also construct the specifications manually using rlFiniteSetSpec and rlNumericSpec.

    Example: [rlFiniteSetSpec([-1 0 1]) rlNumericSpec([2 1])]

    Normalization method, returned as an array in which each element (one for each input channel defined in the observationInfo and actionInfo properties, in that order) is one of the following values:

    • "none" — Do not normalize the input.

    • "rescale-zero-one" — Normalize the input by rescaling it to the interval between 0 and 1. The normalized input Y is (UMin)./(UpperLimitLowerLimit), where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than 0. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.

    • "rescale-symmetric" — Normalize the input by rescaling it to the interval between –1 and 1. The normalized input Y is 2(ULowerLimit)./(UpperLimitLowerLimit) – 1, where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than –1. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.

    Note

    When you specify the Normalization property of rlAgentInitializationOptions, normalization is applied only to the approximator input channels corresponding to rlNumericSpec specification objects in which both the UpperLimit and LowerLimit properties are defined. After you create the agent, you can use setNormalizer to assign normalizers that use any normalization method. For more information on normalizer objects, see rlNormalizer.

    Example: "rescale-symmetric"

    Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

    The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).

    You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB®.

    Note

    Training or simulating an agent on a GPU involves device-specific numerical round-off errors. Because of these errors, you can get different results on a GPU and on a CPU for the same operation.

    To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

    Example: "gpu"

    Learnable parameters of the approximator object, specified as a cell array of dlarray objects. This property contains the learnable parameters of the approximation model used by the approximator object.

    Example: {dlarray(rand(256,4)),dlarray(rand(256,1))}

    State of the approximator object, specified as a cell array of dlarray objects. For dlnetwork-based models, this property contains the Value column of the State property table of the dlnetwork model. The elements of the cell array are the state of the recurrent neural network used in the approximator (if any), as well as the state for the batch normalization layer (if used).

    For model types that are not based on a dlnetwork object, this property is an empty cell array, since these model types do not support states.

    Example: {dlarray(rand(256,1)),dlarray(rand(256,1))}

    Object Functions

    rlSACAgentSoft actor-critic (SAC) reinforcement learning agent
    getActionObtain action from agent, actor, or policy object given environment observations
    evaluateEvaluate function approximator object given observation (or observation-action) input data
    gradient (Not recommended) Evaluate gradient of function approximator object given observation and action input data
    accelerate (Not recommended) Option to accelerate computation of gradient for approximator object based on neural network
    getLearnableParametersObtain learnable parameter values from agent, function approximator, or policy object
    setLearnableParametersSet learnable parameter values of agent, function approximator, or policy object
    setModelSet approximation model in function approximator object
    getModelGet approximation model from function approximator object

    Examples

    collapse all

    Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous five-dimensional space, so that there is a single observation channel that carries a column vector containing five doubles.

    obsInfo = rlNumericSpec([5 1]);

    Create an hybrid action specification object (or alternatively use getActionInfo to extract the specification object from an environment with an hybrid action space). For this example, define a discrete action channel that carries a single scalar that can be -1, 0, or 1, and define a continuous action channel that carries a column vector containing three doubles, each between -10 and 10.

    actInfo = [ 
        rlFiniteSetSpec([-1,0,1]) 
        rlNumericSpec([3 1], ...
            LowerLimit=-10, ...
            UpperLimit=10)
        ];

    A hybrid stochastic actor implements a parameterized stochastic policy for a hybrid action space. This actor takes an observation as input and returns as output an hybrid action consisting of a discrete and a continuous part. The discrete part of the action is sampled from a categorical probability distribution, while the continuous part of the action is sampled from a Gaussian probability distribution.

    To approximate the mean values and standard deviations of the stochastic distribution, you must use a neural network with three output layers. One of these three layers must return a vector containing the probability of each possible discrete action, another layer must return a vector containing the mean values for each dimension of the continuous action space, and the third layer must return a vector containing the standard deviations for each dimension of the continuous action space.

    Therefore, the discrete action probability layer must have the same number of outputs as the number of possible discrete actions, as specified in the first component actInfo. The continuous action mean and standard deviation layers must each have the same number of outputs as the number of dimensions of the continuous action channel, as specified in the second component of actInfo.

    Note that standard deviations must be nonnegative, and mean values must fall within the range of the action. Therefore, the output layer that returns the standard deviations must be a softplus or ReLU layer, to enforce nonnegativity, while the output layer that returns the mean values must be a scaling layer immediately preceded by a tahnLayer, to scale the mean values to the output range. However, if you are going to use the actor within a SAC agent, do not add a tanhLayer and a scalingLayer as the last two nonlinear layers in the mean output path. For more information, see Soft Actor-Critic (SAC) Agent.

    For this example the environment has only one observation channel, and therefore the network has only one input layer. Note that prod(obsInfo.Dimension) and prod(actInfo.Dimension) return the number of dimensions of the observation and action spaces, respectively, regardless of whether they are arranged as row vectors, column vectors, or matrices.

    Define each network path as an array of layer objects and assign names to the input and output layers of each path. These names allow you to connect the paths and then to explicitly associate the network input and output layers with the appropriate environment channel.

    % Common input path layers
    comPath = [ 
        featureInputLayer( ...
            prod(obsInfo.Dimension), ...
            Name="comInLyr");
        fullyConnectedLayer( ...
            prod(actInfo(1).Dimension)+prod(actInfo(2).Dimension));
        reluLayer(Name="comPathOutLyr")
        ];
    
    % Path layers for discrete probabilities 
    dsctPath = [ 
        fullyConnectedLayer( ...
            numel(actInfo(1).Elements), ...
            Name="dsctPathInLyr");
        softmaxLayer('Name','dsctActProbOutLyr')
        ];
    
    % Path layers for mean value 
    % Using scalingLayer to scale range from (-1,1) to (-10,10)
    meanPath = [ 
        fullyConnectedLayer( ...
            prod(actInfo(2).Dimension), ...
            Name="meanPathInLyr");
    
        % DO NOT USE following lines if actor is used for a SAC agent
        tanhLayer;
        scalingLayer(Name="meanActOutLyr",Scale=actInfo(2).UpperLimit)
    
        ];
    
    % Path layers for standard deviations
    % Using softplus layer to make them nonnegative
    sdevPath = [ 
        fullyConnectedLayer( ...
            prod(actInfo(2).Dimension), ...
            Name="stdPathInLyr");
        softplusLayer(Name="stdActOutLyr") 
        ];
    

    Assemble the dlnetwork object.

    net = dlnetwork();
    net = addLayers(net,comPath);
    net = addLayers(net,dsctPath);
    net = addLayers(net,meanPath);
    net = addLayers(net,sdevPath);

    Connect the layers.

    net = connectLayers(net,"comPathOutLyr","dsctPathInLyr/in");
    net = connectLayers(net,"comPathOutLyr","meanPathInLyr/in");
    net = connectLayers(net,"comPathOutLyr","stdPathInLyr/in");

    Plot the network.

    plot(net)

    Figure contains an axes object. The axes object contains an object of type graphplot.

    Initialize the network and display the number of learnable parameters (weights).

    net = initialize(net);
    summary(net)
       Initialized: true
    
       Number of learnables: 69
    
       Inputs:
          1   'comInLyr'   5 features
    

    Create the actor with rlHybridStochasticActor, using the network, the observation and action specification objects, and the names of the network input and output layers.

    actor = rlHybridStochasticActor(net, obsInfo, actInfo, ...
       DiscreteActionOutputNames="dsctActProbOutLyr", ...
       ContinuousActionMeanOutputNames="meanActOutLyr", ...
       ContinuousActionStandardDeviationOutputNames="stdActOutLyr", ...
       ObservationInputNames="comInLyr");

    To check your actor, use getAction to return an action from a random observation vector, using the current network weights.

    act = getAction(actor,{rand(obsInfo.Dimension)});

    Display the discrete action.

    act{1}
    ans = 
    -1
    

    The discrete action is a random sample from the categorical distribution provided by the first output of the neural network, as a function of the current observation.

    Display the continuous action.

    act{2}
    ans = 3x1 single column vector
    
        0.3066
        3.6323
       -2.8120
    
    

    Each of the three elements of the continuous action vector is a random sample from the stochastic distribution, with mean and standard deviation provided by the first output of the neural network, as a function of the current observation.

    To return the stochastic distributions of the action, given an observation, use evaluate.

    dist = evaluate(actor,{rand(obsInfo.Dimension)});

    Display the categorical distribution of the discrete action.

    dist{1}
    ans = 3x1 single column vector
    
        0.3794
        0.2871
        0.3335
    
    

    Display the vector of mean values of the continuous action.

    dist{2}
    ans = 3x1 single column vector
    
       -1.9218
       -0.2399
       -2.7846
    
    

    Display the vector of standard deviations of the continuous action.

    dist{3}
    ans = 3x1 single column vector
    
        0.3488
        0.4983
        0.7860
    
    

    You can now use the actor (along with a critic) to create an agent for the environment described by the given specification objects. The agent that uses a hybrid stochastic actor is rlSACAgent.

    For more information on creating approximator objects such as actors and critics, see Create Policies and Value Functions.

    Version History

    Introduced in R2024b