Main Content

rlVectorQValueFunction

Vector Q-value function approximator for reinforcement learning agents

Since R2022a

Description

This object implements a vector Q-value function approximator that you can use as a critic with a discrete action space for a reinforcement learning agent. A vector Q-value function (also known as vector action-value function) is a mapping from an environment observation to a vector in which each element represents the expected discounted cumulative long-term reward when an agent starts from the state corresponding to the given observation and executes the action corresponding to the element number (and follows a given policy afterwards). A vector Q-value function critic therefore needs only the observation as input. After you create an rlVectorQValueFunction critic, use it to create an agent such as rlQAgent, rlDQNAgent, rlSARSAAgent. For more information on creating actors and critics, see Create Policies and Value Functions.

Creation

Description

example

critic = rlVectorQValueFunction(net,observationInfo,actionInfo) creates the multi-output Q-value function critic with a discrete action space. Here, net is the deep neural network used as an approximation model, and must have only the observations as input and a single output layer having as many elements as the number of possible discrete actions. The network input layers are automatically associated with the environment observation channels according to the dimension specifications in observationInfo. This function sets the ObservationInfo and ActionInfo properties of critic to the observationInfo and actionInfo input arguments, respectively.

example

critic = rlVectorQValueFunction({basisFcn,W0},observationInfo,actionInfo) creates the multi-output Q-value function critic with a discrete action space using a custom basis function as underlying approximation model. The first input argument is a two-element cell array whose first element is the handle basisFcn to a custom basis function and whose second element is the initial weight matrix W0. Here the basis function must have only the observations as inputs, and W0 must have as many columns as the number of possible actions. The function sets the ObservationInfo and ActionInfo properties of critic to the input arguments observationInfo and actionInfo, respectively.

example

critic = rlVectorQValueFunction(___,Name=Value) specifies names of the observation input layers (for network-based approximators) or sets the UseDevice property using one or more name-value arguments. Specifying the input layer names allows you explicitly associate the layers of your network approximator with specific environment channels. For all types of approximators, you can specify the device where computations for critic are executed, for example UseDevice="gpu".

Input Arguments

expand all

Deep neural network used as the underlying approximator within the critic,specified as one of the following:

Note

Among the different network representation options, dlnetwork is preferred, since it has built-in validation checks and supports automatic differentiation. If you pass another network object as an input argument, it is internally converted to a dlnetwork object. However, best practice is to convert other representations to dlnetwork explicitly before using it to create a critic or an actor for a reinforcement learning agent. You can do so using dlnet=dlnetwork(net), where net is any Deep Learning Toolbox™ neural network object. The resulting dlnet is the dlnetwork object that you use for your critic or actor. This practice allows a greater level of insight and control for cases in which the conversion is not straightforward and might require additional specifications.

The network must have as many input layers as the number of environment observation channels (with each input layer receiving input from an observation channel), and a single output layer with as many elements as the number of possible discrete actions.

rlQValueFunction objects support recurrent deep neural networks.

The learnable parameters of the critic are the weights of the deep neural network. For a list of deep neural network layers, see List of Deep Learning Layers. For more information on creating deep neural networks for reinforcement learning, see Create Policies and Value Functions.

Custom basis function, specified as a function handle to a user-defined MATLAB function. The user defined function can either be an anonymous function or a function on the MATLAB path. The output of the critic is the vector c = W'*B, where W is a matrix containing the learnable parameters, and B is the column vector returned by the custom basis function. Each element of a approximates the value of executing the corresponding action from the observed state.

Your basis function must have the following signature.

B = myBasisFunction(obs1,obs2,...,obsN)

Here, obs1 to obsN are inputs in the same order and with the same data type and dimensions as the channels defined in observationInfo.

Example: @(obs1,obs2) [act(2)*obs1(1)^2; abs(obs2(5))]

Initial value of the basis function weights W, specified as a matrix having as many rows as the length of the basis function output vector and as many columns as the number of possible actions.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: UseDevice="gpu"

Network input layers names corresponding to the environment observation channels, specified as a string array or a cell array of strings or character vectors. The function assigns, in sequential order, each environment observation channel specified in observationInfo to each layer whose name is specified in the array assigned to this argument. Therefore, the specified network input layers, ordered as indicated in this argument, must have the same data type and dimensions as the observation channels, as ordered in observationInfo.

This name-value argument is supported only when the approximation model is a deep neural network.

Example: ObservationInputNames={"obsInLyr1_airspeed","obsInLyr2_altitude"}

Properties

expand all

Observation specifications, specified as an rlFiniteSetSpec or rlNumericSpec object or an array containing a mix of such objects. Each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

When you create the approximator object, the constructor function sets the ObservationInfo property to the input argument observationInfo.

You can extract observationInfo from an existing environment, function approximator, or agent using getObservationInfo. You can also construct the specifications manually using rlFiniteSetSpec or rlNumericSpec.

Example: [rlNumericSpec([2 1]) rlFiniteSetSpec([3,5,7])]

Action specifications, specified as an rlNumericSpec object. This object defines the properties of the environment action channel, such as its dimensions, data type, and name.

Note

Only one action channel is allowed.

When you create the approximator object, the constructor function sets the ActionInfo property to the input argument actionInfo.

You can extract ActionInfo from an existing environment, approximator object, or agent using getActionInfo. You can also construct the specification manually using rlFiniteSetSpec.

Example: rlNumericSpec([2 1])

Normalization method, returned as an array in which each element (one for each input channel defined in the observationInfo and actionInfo properties, in that order) is one of the following values:

  • "none" — Do not normalize the input of the function approximator object.

  • "rescale-zero-one" — Normalize the input by rescaling it to the interval between 0 and 1. The normalized input Y is (UMin)./(UpperLimitLowerLimit), where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than 0. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.

  • "rescale-symmetric" — Normalize the input by rescaling it to the interval between –1 and 1. The normalized input Y is 2(ULowerLimit)./(UpperLimitLowerLimit) – 1, where U is the nonnormalized input. Note that nonnormalized input values lower than LowerLimit result in normalized values lower than –1. Similarly, nonnormalized input values higher than UpperLimit result in normalized values higher than 1. Here, UpperLimit and LowerLimit are the corresponding properties defined in the specification object of the input channel.

Note

When you specify the Normalization property of rlAgentInitializationOptions, normalization is applied only to the approximator input channels corresponding to rlNumericSpec specification objects in which both the UpperLimit and LowerLimit properties are defined. After you create the agent, you can use setNormalizer to assign normalizers that use any normalization method. For more information on normalizer objects, see rlNormalizer.

Example: "rescale-symmetric"

Computation device used to perform operations such as gradient computation, parameter update and prediction during training and simulation, specified as either "cpu" or "gpu".

The "gpu" option requires both Parallel Computing Toolbox™ software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).

You can use gpuDevice (Parallel Computing Toolbox) to query or select a local GPU device to be used with MATLAB®.

Note

Training or simulating an agent on a GPU involves device-specific numerical round-off errors. These errors can produce different results compared to performing the same operations using a CPU.

To speed up training by using parallel processing over multiple cores, you do not need to use this argument. Instead, when training your agent, use an rlTrainingOptions object in which the UseParallel option is set to true. For more information about training using multicore processors and GPUs for training, see Train Agents Using Parallel Computing and GPUs.

Example: "gpu"

Learnable parameters of the approximation object, specified as a cell array of dlarray objects. This property contains the learnable parameters of the approximation model used by the approximator object.

Example: {dlarray(rand(256,4)),dlarray(rand(256,1))}

State of the approximation object, specified as a cell array of dlarray objects. For dlnetwork-based models, this property contains the Value column of the State property table of the dlnetwork model. The elements of the cell array are the state of the recurrent neural network used in the approximator (if any), as well as the state for the batch normalization layer (if used).

For model types that are not based on a dlnetwork object, this property is an empty cell array, since these model types do not support states.

Example: {dlarray(rand(256,1)),dlarray(rand(256,1))}

Object Functions

rlDQNAgentDeep Q-network (DQN) reinforcement learning agent
rlQAgentQ-learning reinforcement learning agent
rlSARSAAgentSARSA reinforcement learning agent
getValueObtain estimated value from a critic given environment observations and actions
getMaxQValueObtain maximum estimated value over all possible actions from a Q-value function critic with discrete action space, given environment observations
evaluateEvaluate function approximator object given observation (or observation-action) input data
getLearnableParametersObtain learnable parameter values from agent, function approximator, or policy object
setLearnableParametersSet learnable parameter values of agent, function approximator, or policy object
setModelSet approximation model in function approximator object
getModelGet approximation model from function approximator object

Examples

collapse all

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that there is a single observation channel that carries a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible actions (labeled 7, 5, and 3).

actInfo = rlFiniteSetSpec([7 5 3]);

A vector Q-value function takes only the observation as input and returns as output a single vector with as many elements as the number of possible actions. The value of each output element represents the expected discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy afterwards.

To model the parametrized vector Q-value function within the critic, use a neural network with one input layer (receiving the content of the observation channel, as specified by obsInfo) and one output layer (returning the vector of values for all the possible actions, as defined by actInfo).

Define the network as an array of layer objects, and get the dimension of the observation space and the number of possible actions from the environment specification objects.

net = [  
    featureInputLayer(obsInfo.Dimension(1))
    fullyConnectedLayer(16)
    reluLayer
    fullyConnectedLayer(16)
    reluLayer
    fullyConnectedLayer(numel(actInfo.Elements)) 
    ];

Convert the network to a dlnetwork object, and display the number of weights.

net = dlnetwork(net);
summary(net)
   Initialized: true

   Number of learnables: 403

   Inputs:
      1   'input'   4 features

Create the critic using the network, as well as the observation and action specification objects.

critic = rlVectorQValueFunction(net,obsInfo,actInfo)
critic = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
      Normalization: "none"
          UseDevice: "cpu"
         Learnables: {6x1 cell}
              State: {0x1 cell}

To check your critic, use getValue to return the values of a random observation, using the current network weights. There is one value for each of the three possible actions.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = 3x1 single column vector

    0.0761
   -0.5906
    0.2072

You can now use the critic to create an agent for the environment described by the given specification objects. Examples of agents that can work with a discrete action space, a continuous observation space, and use a vector Q-value function critic, are rlQAgent, rlDQNAgent, and rlSARSAAgent.

For more information on creating approximator objects such as actors and critics, see Create Policies and Value Functions.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as a continuous four-dimensional space, so that there is a single observation channel that carries a column vector containing four doubles.

obsInfo = rlNumericSpec([4 1]);

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible values (labeled 7, 5, and 3).

actInfo = rlFiniteSetSpec([7 5 3]);

A vector Q-value function takes only the observation as input and returns as output a single vector with as many elements as the number of possible actions. The value of each output element represents the expected discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy afterwards.

To model the parametrized vector Q-value function within the critic, use a neural network with one input layer (receiving the content of the observation channel, as specified by obsInfo) and one output layer (returning the vector of values for all the possible actions, as defined by actInfo).

Define the network as an array of layer objects, and get the dimension of the observation space and the number of possible actions from the environment specification objects. Name the network input netObsIn (so you can later explicitly associate it with the observation input channel).

net = [  
    featureInputLayer(obsInfo.Dimension(1),Name="netObsIn")
    fullyConnectedLayer(32)
    tanhLayer
    fullyConnectedLayer(numel(actInfo.Elements))
    ];

Convert the network to a dlnetwork object and display the number of its learnable parameters.

net = dlnetwork(net)
net = 
  dlnetwork with properties:

         Layers: [4x1 nnet.cnn.layer.Layer]
    Connections: [3x2 table]
     Learnables: [4x3 table]
          State: [0x3 table]
     InputNames: {'netObsIn'}
    OutputNames: {'fc_2'}
    Initialized: 1

  View summary with summary.

summary(net)
   Initialized: true

   Number of learnables: 259

   Inputs:
      1   'netObsIn'   4 features

Create the critic using the network, the observations specification object, and the name of the network input layer. The specified network input layer, netObsIn, is associated with the environment observation, and therefore must have the same data type and dimension as the observation channel specified in obsInfo.

critic = rlVectorQValueFunction(net, ...
    obsInfo,actInfo, ...
    ObservationInputNames="netObsIn")
critic = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [1x1 rl.util.rlNumericSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
      Normalization: "none"
          UseDevice: "cpu"
         Learnables: {4x1 cell}
              State: {0x1 cell}

To check your critic, use getValue to return the values of a random observation, using the current network weights. The function returns one value for each of the three possible actions.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = 3x1 single column vector

    0.0435
    0.1906
    0.7386

You can now use the critic to create an agent for the environment described by the given specification objects. Examples of agents that can work with a discrete action space, a continuous observation space, and use a vector Q-value function critic, are rlQAgent, rlDQNAgent, and rlSARSAAgent.

For more information on creating approximator objects such as actors and critics, see Create Policies and Value Functions.

Create an observation specification object (or alternatively use getObservationInfo to extract the specification object from an environment). For this example, define the observation space as consisting of two channels, the first carrying a two-by-two continuous matrix and the second carrying scalar that can assume only two values, 0 and 1.

obsInfo = [rlNumericSpec([2 2]) 
           rlFiniteSetSpec([0 1])];

Create a finite set action specification object (or alternatively use getActionInfo to extract the specification object from an environment with a discrete action space). For this example, define the action space as a finite set consisting of three possible vectors, [1 2], [3 4], and [5 6].

actInfo = rlFiniteSetSpec({[1 2],[3 4],[5 6]});

A vector Q-value function takes a batch of observations as input (note that there is no action input) and returns as output a corresponding batch of vectors each with as many elements as the number of possible actions.

To model the parametrized vector Q-value function within the critic, use a custom basis function with two inputs (which receive the content of the environment observation channels, as specified by obsInfo).

Create a function that returns a vector of four elements, given an observation as input. Here, the third dimension is the batch dimension. For each element of the batch dimension, the output of the basis function is a vector with four elements.

myBasisFcn = @(obsA,obsB) [
    obsA(1,1,:)+obsB(1,1,:).^2;
    obsA(2,1,:)-obsB(1,1,:).^2;
    obsA(1,2,:).^2+obsB(1,1,:);
    obsA(2,2,:).^2-obsB(1,1,:)
    ];

For each element of the batch, the output of the critic is the vector c = W'*myBasisFcn(obsA,obsB), where W is a weight matrix which must have as many rows as the length of the basis function output and as many columns as the number of possible actions.

Each element of c represents the expected cumulative long term reward when an agent starts from the given observation and takes the action corresponding to the position of the considered element (and follows the policy afterwards). The elements of W are the learnable parameters.

Define an initial parameter matrix.

W0 = rand(4,3);

Create the critic. The first argument is a two-element cell containing both the handle to the custom function and the initial parameter matrix. The second and third arguments are, respectively, the observation and action specification objects.

critic = rlVectorQValueFunction({myBasisFcn,W0},obsInfo,actInfo)
critic = 
  rlVectorQValueFunction with properties:

    ObservationInfo: [2x1 rl.util.RLDataSpec]
         ActionInfo: [1x1 rl.util.rlFiniteSetSpec]
      Normalization: ["none"    "none"]
          UseDevice: "cpu"
         Learnables: {[3x4 dlarray]}
              State: {}

To check your critic, use getValue to return the values of a random observation, using the current parameter matrix. The function returns one value for each of the three possible actions.

v = getValue(critic,{rand(2,2),0})
v = 3x1 single column vector

    1.9733
    1.1479
    2.2037

Note that the critic does not enforce the set constraint for the discrete set elements.

v = getValue(critic,{rand(2,2),-1})
v = 3x1 single column vector

    2.0251
    1.4035
    2.2437

Obtain values for a random batch of 10 observations.

v = getValue(critic,{ ...
    rand([obsInfo(1).Dimension 10]), ...
    rand([obsInfo(2).Dimension 10]) ...
    });

Display the values corresponding to the seventh element of the observation batch.

v(:,7)
ans = 3x1 single column vector

    1.1097
    0.8339
    1.2711

You can now use the critic to create an agent for the environment described by the given specification objects. Examples of agents that can work with a discrete action space, a mixed observation space, and use a vector Q-value function critic, are rlQAgent, rlDQNAgent, and rlSARSAAgent.

For more information on creating approximator objects such as actors and critics, see Create Policies and Value Functions.

Create an environment and obtain observation and action specification objects.

env = rlPredefinedEnv("CartPole-Discrete");
obsInfo = getObservationInfo(env);
actInfo = getActionInfo(env);

A vector Q-value function takes only the observation as input and returns as output a single vector with as many elements as the number of possible actions. The value of each output element represents the expected discounted cumulative long-term reward for taking the action from the state corresponding to the current observation, and following the policy afterwards.

To model the parametrized vector Q-value function within the critic, use a recurrent neural network with one input layer (receiving the content of the observation channel, as specified by obsInfo) and one output layer (returning the vector of values for all the possible actions, as defined by actInfo).

Define the network as an array of layer objects, and get the dimension of the observation space and the number of possible actions from the environment specification objects. To create a recurrent network, use a sequenceInputLayer as the input layer (with size equal to the number of dimensions of the observation channel) and include at least one lstmLayer.

net = [
    sequenceInputLayer(obsInfo.Dimension(1))
    fullyConnectedLayer(50)
    reluLayer
    lstmLayer(20)
    fullyConnectedLayer(20)
    reluLayer
    fullyConnectedLayer(numel(actInfo.Elements)) 
    ];

Convert the network to a dlnetwork object, and display the number of weights.

net = dlnetwork(net);
summary(net)
   Initialized: true

   Number of learnables: 6.3k

   Inputs:
      1   'sequenceinput'   Sequence input with 4 dimensions

Create the critic using the network, as well as the observation and action specification objects.

critic = rlVectorQValueFunction(net, ...
    obsInfo,actInfo);

To check your critic, use getValue to return the value of a random observation and action, using the current network weights.

v = getValue(critic,{rand(obsInfo.Dimension)})
v = 2×1 single column vector

    0.0092
   -0.0016

You can use dot notation to extract and set the current state of the recurrent neural network in the critic.

critic.State
ans=2×1 cell array
    {20×1 dlarray}
    {20×1 dlarray}

critic.State = { 
    -0.1*dlarray(rand(20,1))
     0.1*dlarray(rand(20,1))
     };

To evaluate the critic using sequential observations, use the sequence length (time) dimension. For example, obtain actions for 5 independent sequences each one consisting of 9 sequential observations.

[value,state] = getValue(critic, ...
    {rand([obsInfo.Dimension 5 9])});

Display the value corresponding to the seventh element of the observation sequence in the fourth sequence.

value(1,4,7)
ans = single
    0.0557

Display the updated state of the recurrent neural network.

state
state=2×1 cell array
    {20×5 single}
    {20×5 single}

You can now use the critic to create an agent for the environment described by the given specification objects. Examples of agents that can work with a discrete action space, a continuous observation space, and use a vector Q-value function critic, are rlQAgent, rlDQNAgent, and rlSARSAAgent.

For more information on creating approximator objects such as actors and critics, see Create Policies and Value Functions.

Version History

Introduced in R2022a