Train Agents Using Parallel Computing and GPUs
If you have Parallel Computing Toolbox™ software, you can run parallel simulations on multicore processors or GPUs. If you additionally have MATLAB® Parallel Server™ software, you can run parallel simulations on computer clusters or cloud resources.
Note that parallel training and simulation of agents using recurrent neural networks, or agents within multi-agent environments, is not supported.
Independently on which devices you use to simulate or train the agent, once the agent has been trained, you can generate code to deploy the optimal policy on a CPU or GPU. This is explained in more detail in Deploy Trained Reinforcement Learning Policies.
Using Multiple Processes
When you train agents using parallel computing, the parallel pool client (the MATLAB process that starts the training) sends copies of both its agent and environment to each parallel worker. Each worker simulates the agent within the environment and sends their simulation data back to the client. The client agent learns from the data sent by the workers and sends the updated policy parameters back to the workers.
To create a parallel pool of
N workers, use the following
pool = parpool(N);
If you do not create a parallel pool using
parpool (Parallel Computing Toolbox), the
train function automatically creates one
using your default parallel pool preferences. For more information on specifying these
preferences, see Specify Your Parallel Preferences (Parallel Computing Toolbox). Note that using a parallel pool of thread workers,
pool = parpool("threads"), is not supported.
To train an agent using multiple processes you must pass to the
train function an
object in which the
UseParallel property is set to
For more information on configuring your training to use parallel computing, see the
For an example on how to configure options for asynchronous advantage actor-critic (A3C)
agent training, see the last example in
For an example that trains an agent using parallel computing in MATLAB, see Train AC Agent to Balance Cart-Pole System Using Parallel Computing. For an example that trains an agent using parallel computing in Simulink®, see Train DQN Agent for Lane Keeping Assist Using Parallel Computing and Train Biped Robot to Walk Using Reinforcement Learning Agents.
Agent-Specific Parallel Training Considerations
Reinforcement learning agents can be trained in parallel in two main ways, experience-based parallelization, in which the workers only calculate experiences, and gradient-based parallelization, in which the workers also calculate the gradients that allow the agent approximators to learn.
Experience-Based Parallelization (DQN, DDPG, TD3, SAC, PPO, TRPO)
When training an DQN, DDPG, TD3, SAC, PPO or TRPO agent in parallel, the environment simulation is done by the workers and the gradient computation is done by the client. Specifically, the workers simulate (their copy of) the agent within (their copy of) the environment, and send experience data (observation, action, reward, next observation, and a termination signal) to the client. The client then computes the gradients from experiences, updates the agent parameters and sends the updated parameters back to the workers, which then continue to perform simulations using their copy of the updated agent.
This type of parallel training is also known as experience-based parallelization, and
can run using asynchronous training (that is the
Mode property of the
object that you pass to the
can be set to
Experience-based parallelization can reduce training time only when the computational cost of simulating the environment is high compared to the cost of optimizing network parameters. Otherwise, when the environment simulation is fast enough, the workers lie idle waiting for the client to learn and send back the updated parameters.
In other words, experience-based parallelization can improve sample efficiency (intended as the number of samples an agent can process within a given time) only when the ratio R between the environment step complexity and the learning complexity is large. If both environment simulation and gradient computation (that is, learning) are similarly computationally expensive, experience-based parallelization is unlikely to improve sample efficiency. In this case, for off-policy agents that are supported in parallel (DQN, DDPG, TD3, and SAC) you can reduce the mini-batch size to make R larger, thereby improving sample efficiency.
For experience-based parallelization, do not use all of your processor cores for parallel training. For example, if your CPU has six cores, train with four workers. Doing so provides more resources for the parallel pool client to compute gradients based on the experiences sent back from the workers.
For and example of experience-based parallel training, see Train DQN Agent for Lane Keeping Assist Using Parallel Computing.
Gradient-Based Parallelization (AC and PG)
When training an AC or PG agent in parallel, both the environment simulation and gradient computations are done by the workers. Specifically, workers simulate (their copy of) the agent within (their copy of) the environment, obtain the experiences, compute the gradients from the experiences, and send the gradients to the client. The client averages the gradients, updates the agent parameters and sends the updated parameters back to the workers so they can continue to perform simulations using an updated copy of the agent.
This type of parallel training is also known as gradient-based parallelization, and
allows you to achieve, in principle, a speed improvement which is nearly linear in the
number of workers. However, this option requires synchronous training
(that is the
Mode property of the
object that you pass to the
must be set to
"sync"). This means that workers must pause execution
until all workers are finished, and as a result the training only advances as fast as the
slowest worker allows.
In general, limiting the number of workers in order to leave some processor cores for the client is not necessary when using gradient-based parallelization, because the gradients are not computed on the client. Therefore, for gradient-based parallelization, it might be beneficial to use all your processor cores for parallel training.
For and example of gradient-based parallel training, see Train AC Agent to Balance Cart-Pole System Using Parallel Computing.
You can speed up training by performing actor and critic operations (such as gradient
computation and prediction), on a local GPU rather than a CPU. To do so, when creating a
critic or actor, set its
UseDevice option to
"gpu" option requires both Parallel Computing Toolbox software and a CUDA® enabled NVIDIA® GPU. For more information on supported GPUs see GPU Computing Requirements (Parallel Computing Toolbox).
You can use
gpuDevice (Parallel Computing Toolbox) to query or select a local GPU
device to be used with MATLAB.
Using GPUs is likely to be beneficial when you have a deep neural network in the actor or critic which has large batch sizes or needs to perform operations such as multiple convolutional layers on input images.
For an example on how to train an agent using the GPU, see Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation.
Using both Multiple Processes and GPUs
You can also train agents using both multiple processes and a local GPU (previously
gpuDevice (Parallel Computing Toolbox)) at the same time. To do so, first
create a critic or actor approximator object in which the
option is set to
"gpu". You can then use the critic and actor to create
an agent, and then train the agent using multiple processes. This is done by creating an
object in which
UseParallel is set to
passing it to the
For gradient-based parallelization, (which must run in synchronous mode) the environment simulation is done by the workers, which also use their local GPU to calculate the gradients and perform a prediction step. The gradients are then sent back to the parallel pool client process which calculates the averages, updates the network parameters and sends them back to the workers so they continue to simulate the agent, with the new parameters, against the environment.
For experience-based parallelization, (which can run in asynchronous mode), the workers simulate the agent against the environment, and send experiences data back to the parallel pool client. The client then uses its local GPU to compute the gradients from the experiences, then updates the network parameters and sends the updated parameters back to the workers, which continue to simulate the agent, with the new parameters, against the environment.
Note that when using both parallel processing and GPU to train PPO agents, the workers use their local GPU to compute the advantages, and then send processed experience trajectories (which include advantages, targets and action probabilities) back to the client.
gpuDevice(Parallel Computing Toolbox)
- Train AC Agent to Balance Cart-Pole System Using Parallel Computing
- Train DQN Agent for Lane Keeping Assist Using Parallel Computing
- Train Biped Robot to Walk Using Reinforcement Learning Agents
- Train DDPG Agent to Swing Up and Balance Pendulum with Image Observation
- Train Reinforcement Learning Agents