Deep Deterministic Policy Gradient (DDPG) Agent
The deep deterministic policy gradient (DDPG) algorithm is an off-policy actor-critic method for environments with a continuous action-space. A DDPG agent learns a deterministic policy while also using a Q-value function critic to estimate the value of the optimal policy. It features a target actor and critic as well as an experience buffer. DDPG agents supports offline training (training from saved data, without an environment). For more information on the different types of reinforcement learning agents, see Reinforcement Learning Agents.
In Reinforcement Learning Toolbox™, a deep deterministic policy gradient agent is implemented by an rlDDPGAgent
object.
DDPG agents can be trained in environments with the following observation and action spaces.
Observation Space | Action Space |
---|---|
Continuous or discrete | Continuous |
DDPG agents use the following actor and critic.
Critic | Actor |
---|---|
Q-value function critic
Q(S,A), which you create
using | Deterministic policy actor π(S),
which you create using |
During training, a DDPG agent:
Updates the actor and critic learnable parameters at each time step during learning.
Stores past experiences using a circular experience buffer. The agent updates the actor and critic using a mini-batch of experiences randomly sampled from the buffer.
Perturbs the action chosen by the policy using a stochastic noise model at each training step.
Actor and Critic Functions
To estimate the policy and value function, a DDPG agent maintains four function approximators:
Actor π(S;θ)— The actor, with parameters θ, takes observation S and returns the corresponding action that maximizes the long-term reward.
Target actor πt(S;θt) — To improve the stability of the optimization, the agent periodically updates the target actor learnable parameters θt using the latest actor parameter values.
Critic Q(S,A;ϕ) — The critic, with parameters ϕ, takes observation S and action A as inputs and returns the corresponding expectation of the long-term reward.
Target critic Qt(S,A;ϕt) — To improve the stability of the optimization, the agent periodically updates the target critic learnable parameters ϕt using the latest critic parameter values.
Both Q(S,A;ϕ) and Qt(S,A;ϕt) have the same structure and parameterization, and both π(S;θ) and πt(S;θt) have the same structure and parameterization.
For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.
During training, the agent tunes the parameter values in θ. After training, the parameters remain at their tuned value and the trained actor function approximator is stored in π(S).
Agent Creation
You can create and train DDPG agents at the MATLAB® command line or using the Reinforcement Learning Designer app. For more information on creating agents using Reinforcement Learning Designer, see Create Agents Using Reinforcement Learning Designer.
At the command line, you can create a DDPG agent with default actor and critics based on the observation and action specifications from the environment. To do so, perform the following steps.
Create observation specifications for your environment. If you already have an environment object, you can obtain these specifications using
getObservationInfo
.Create action specifications for your environment. If you already have an environment object, you can obtain these specifications using
getActionInfo
.If needed, specify the number of neurons in each learnable layer of the default network or whether to use an LSTM layer. To do so, create an agent initialization option object using
rlAgentInitializationOptions
.If needed, specify agent options using an
rlDDPGAgentOptions
object.Create the agent using an
rlDDPGAgent
object.
Alternatively, you can create actor and critic and use these objects to create your agent. In this case, ensure that the input and output dimensions of the actor and critic match the corresponding action and observation specifications of the environment.
Create an actor using an
rlContinuousDeterministicActor
object.Create a critic using an
rlQValueFunction
object.Specify agent options using an
rlDDPGAgentOptions
object (alternatively, you can skip this step and then modify the agent options later using dot notation).Create the agent using an
rlDDPGAgent
object.
For more information on creating actors and critics for function approximation, see Create Policies and Value Functions.
Training Algorithm
DDPG agents use the following training algorithm, in which they update their actor and
critic models at each time step. To configure the training algorithm, specify options using
an rlDDPGAgentOptions
object.
Initialize the critic Q(S,A;ϕ) with random parameter values ϕ, and initialize the target critic parameters ϕt with the same values: .
Initialize the actor π(S;θ) with random parameter values θ, and initialize the target actor parameters θt with the same values: .
For each training time step:
For the current observation S, select action A = π(S;θ) + N, where N is stochastic noise from the noise model. To configure the noise model, use the
NoiseOptions
option.Execute action A. Observe the reward R and next observation S'.
Store the experience (S,A,R,S') in the experience buffer. To specify the size of the experience buffer, use the
ExperienceBufferLength
option in the agentrlDDPGAgentOptions
object.Sample a random mini-batch of M experiences (Si,Ai,Ri,S'i) from the experience buffer. To specify M, use the
MiniBatchSize
property of therlDDPGAgentOptions
object.If S'i is a terminal state, set the value function target yi to Ri. Otherwise, set it to
The value function target is the sum of the experience reward Ri and the discounted future reward. To specify the discount factor γ, use the
DiscountFactor
option.To compute the cumulative reward, the agent first computes a next action by passing the next observation S'i from the sampled experience to the target actor. The agent finds the cumulative reward by passing the next action to the target critic.
If you specify a value of
NumStepsToLookAhead
equal to N, then the N-step return (which adds the rewards of the following N steps and the discounted estimated value of the state that caused the N-th reward) is used to calculate the target yi.Update the critic parameters by minimizing the loss L across all sampled experiences.
Update the actor parameters using the following sampled policy gradient to maximize the expected discounted cumulative long-term reward.
Here, Gai is the gradient of the critic output with respect to the action computed by the actor network, and Gπi is the gradient of the actor output with respect to the actor parameters. Both gradients are evaluated for observation Si.
Update the target actor and critic parameters depending on the target update method. For more information see Target Update Methods.
For simplicity, the actor and critic updates in this algorithm show a gradient update
using basic stochastic gradient descent. The actual gradient update method depends on the
optimizer you specify in the rlOptimizerOptions
object assigned to the
rlCriticOptimizerOptions
property.
Target Update Methods
DDPG agents update their target actor and critic parameters using one of the following target update methods.
Smoothing — Update the target parameters at every time step using smoothing factor τ. To specify the smoothing factor, use the
TargetSmoothFactor
option.Periodic — Update the target parameters periodically without smoothing (
TargetSmoothFactor = 1
). To specify the update period, use theTargetUpdateFrequency
parameter.Periodic Smoothing — Update the target parameters periodically with smoothing.
To configure the target update method, create a rlDDPGAgentOptions
object, and set the TargetUpdateFrequency
and
TargetSmoothFactor
parameters as shown in the following table.
Update Method | TargetUpdateFrequency | TargetSmoothFactor |
---|---|---|
Smoothing (default) | 1 | Less than 1 |
Periodic | Greater than 1 | 1 |
Periodic smoothing | Greater than 1 | Less than 1 |
References
[1] Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous Control with Deep Reinforcement Learning.” ArXiv:1509.02971 [Cs, Stat], September 9, 2015. https://arxiv.org/abs/1509.02971.