Video length is 28:06

Parallel Computing Hands-On Workshop

This video accompanies a hands-on workshop introducing you to parallel computing with MATLAB® and Simulink®, so that you can solve computationally and data-intensive problems using multicore processors, GPUs, and computer clusters. By working through common scenarios to parallelize MATLAB algorithms and run multiple Simulink simulations in parallel, you will gain an understanding of parallel computing with MATLAB and Simulink and learn about best practices.

Along with the video, exercises and examples are provided to reinforce how to use parallel computing with MATLAB and Simulink. Workshop exercises and examples will vary in difficulty from simple parallel usage concepts to more advanced techniques.

Highlights

• Speeding up MATLAB applications with parallel computing
• Running multiple Simulink simulations in parallel
• GPU computing
• Offloading computations and cluster computing
• Working with large data sets

Published: 5 Jul 2020

Hi, everyone, and welcome to this workshop on parallel computing with MATLAB and Simulink. Parallel computing is an important topic because the problems that engineers and researchers face are growing larger and becoming more complex. In addition, as technology evolves, expectations have increased for faster and more efficient results.

Parallel computing in MATLAB and Simulink enables engineers, scientists, and researchers in any sector or industry to leverage the compute resources readily available to them without needing to be experts in parallel computing. Here are some examples of real performance gains experienced by MATLAB customers who used MATLAB parallel computing tools to speed up their work. This workshop will guide you through the steps and tips to doing the same, and keep in mind that parallel competing hardware is becoming increasingly widespread and available.

Multicore processors are the norm, and GPU devices capable of computation are becoming commonplace. In addition, access to cluster and cloud environments has been increasing, providing the ability to use computational resources beyond what is available on a typical workstation. Whether you're planning to leverage multicore processors, compute clusters, or GPUs, it is important to optimize your code so that you can receive even better performance improvements when you introduce parallel tools for additional computing power.

Let's highlight some steps you can take to optimize your code. Before modifying your code, you need to determine where to focus your efforts. Perhaps the most critical tool to support this process is the profiler, which can help you find bottlenecks by telling you where your code is spending most of its execution time. Improving those areas gives you the biggest performance boost for your efforts.

Once you've located your areas of investment, you can use effective programming techniques like preallocation and vectorization to accelerate the execution of your MATLAB code. The MATLAB code analyzer can help advise you with this in addition to bringing issues and errors in your code to your attention. Finally, you may also obtain speed UPS by replacing parts of your MATLAB code with an automatically generated MATLAB executable, known as a MEX function. You can do this using a separate product called MATLAB Coder.

It's important to note that even without Parallel Computing Toolbox, MATLAB provides implicit multicore support with its built in multi threading. A growing number of core MATLAB functions take advantage of underlying multi-threaded library support and other toolboxes leverage those benefits when they use core MATLAB functions. However, not every MATLAB function is able to be multi-threaded and any acceleration is limited to your local workstation. Therefore, parallel computing tools enable you to receive benefits beyond these limitations.

Put another way, Parallel Computing Toolbox box enables direct control of your parallel resources. For example, parallel constructs like parfor let control which portions of your workflow are distributed to multiple cores. Later on, we'll talk about how you can extend that level of control to resources on compute clusters by using MATLAB parallel server.

The following video clip will clearly demonstrate performance improvements obtained using parallel computing in MATLAB, specifically with parfor. We have three different scenarios where we run the same parameter sweep code in three different computing environments, a single desktop workstation, a cluster of 200 cores, and a cluster of 1,000 cores. As you saw for this problem, using 1000 cores provided a very significant speedup.

That being said, we should mention that throwing more cores at a problem doesn't always give you proportionally faster results. As a general rule of thumb, if your model or application is computationally intensive and you have a large number of independent iterations to complete, you can likely make efficient use of a large number of cores to speed up your overall execution time. Having discussed the motivation for and usefulness of parallel computing, let's talk about how to utilize it in MATLAB.

We'll start by talking about using multiple cores on a desktop computer, where will also learn some parallel computing fundamentals like the concept of a worker and how a parfor like the concept of a worker, and how a parfor loop works. After that, we'll talk about using GPUs followed by scaling up to a cluster or cloud environment. We'll then give some tips on using parallel computing with big data.

MathWorks offers two parallel computing tools. We alluded several times to Parallel Computing Toolbox, which we will cover now. Later, we'll talk about MATLAB Parallel Server. Those of you licensed to use Parallel Computing Toolbox will install it alongside MATLAB. We'll use the term MATLAB client to refer to the machine with a tool box installed.

The toolbox will allow you to be more productive with your multicore processors by utilizing MATLAB computational engines called workers. These workers are controlled by your MATLAB session and allow you to use the full potential of your hardware to speed up your workflow. You can use workers interactively or send work for them to run in the background. Workers form the basis for CPU-based parallel workflows.

When you have a collection of workers with interprocess communication, we call that a parallel pool. You can initialize and manage a parallel pool programmatically using MATLAB code, or interactively from this icon in the MATLAB desktop environment. Parallel Computing Toolbox handles the work involved in dividing up tasks and computations and assigning them to workers in the parallel pool, thereby enabling your resources to perform parallel computing.

The behind the scenes work is all encapsulated in easy to use syntax. Sometimes as simple as just changing one word, and you never have to leave the familiarity of the MATLAB desktop environment. In general, you should not run more MATLAB workers than the number of physical cores available to your machine, otherwise you are likely to have resource contention.

Now that we've covered the basics of enabling parallel computing through workers, let's talk about what we can do with them. Some parallel constructs in MATLAB are easier to get started with but offer less control. Others require more knowledge of parallel computing but offer more granular control. We'll start with the easiest to use and work our way down. A large number of MATLAB toolboxes have automatic parallel support built into them.

If you find that a function has parallel support and it's contained in your bottleneck, you can accelerate your code with very little effort. Here are examples of functions across different applications with automatic parallel support. The link at the bottom will give you the full list of what toolboxes and functions have automatic parallel support. Similarly, a number of parallel enabled block sets in toolboxes for Simulink can help you speed up your workflows with very little effort.

For example, Simulink design optimization has one of the best integrations with Parallel Computing Toolbox. You just enable a single checkbox, use parallel pool during optimization, and it will immediately speed up workflows like sensitivity analysis, response optimization, and parameter estimation. Once again, you can use the link at the bottom to see the full list of automatic parallel support.

Let's move on to the next level. If your bottleneck does not involve a function with automatic parallel support, there are plenty of parallel constructs available in MATLAB that give you more control over what and how something is paralyzed. The parallel computing team at MathWorks is actively adding more constructs and improving existing ones. As we mentioned earlier, depending on the problem, parallel computing does not always give you proportional improvements, however, there are ideal problems for parallel computing where computationally intensive problems are just a matter of multiple tasks, iterations, or simulations that don't depend on each other to complete their calculations.

Real world examples of such problems are Monte Carlo simulations, parameter sweeps, and design optimization, and the easiest way to address this challenge is to use parallel for loops. For example, let's say you want to run five iterations of your code. If you run it in a for loop, they run serially, one after the other. You wait for one to complete before moving to the next iteration. However, if they're all independent tasks with no dependencies or communication needed between individual iterations, you can distribute these tasks to separate workers and compute them in parallel at the same time.

This maximizes the utilization of the course on your machine and gets you the results sooner. Parallel for loops are implemented using the parfor command. While requires parallel computing toolbox to be able to leverage workers for parallel processing, it will actually still run without it. That means you can share code that uses parfor with colleagues and collaborators who might not have access to Parallel Computing Toolbox.

In that scenario, parfor will behave like a traditional for loop, albeit with a different order of iterations. In this example, we want to take a typical serial for loop and run it in parallel using our multicore processor. The iterations in this for loop are not dependent on one another and do not need to pass information between each other. All we have to do here is change the for loop to a parfor loop, and this will automatically run the iterations in parallel across multiple workers.

Parfor will automatically distribute the tasks to the available workers and collect results upon completion. When changing for to parfor, you may need to make some adjustments to your code. The code analyzer will help to guide you through this process by informing you of what changes need to be made in order to run the parfor loop.

In this example, no warnings are displayed and no additional code changes are needed. In the second example with slightly different code, the code analyzer determines that there is an issue and brings it to our attention. The following illustration will give you more insight into what happens when parfor executes. In this example, MATLAB has access to three workers. They are assigned tasks to run, and once a worker has finished its current task, it can be assigned additional work.

Finally, the results are collected and can be displayed in MATLAB. When MATLAB recognizes a name in a parfor loop as a variable, the variable is classified in one of several categories shown in the table on the right. Two variable types that can have a significant impact on your runtime are slice variables and broadcast variable. A sliced variable is a variable whose value can be broken up into segments or slices, which are then operated on separately by different workers.

Each iteration of the loop works on a different slice of the array. Using slice variables can reduce the amount of required communication between a client and the workers. A broadcast variable is any variable other than the loop variable or a slice variable that does not change inside the loop. At the start of a parfor loop, the values of any broadcast variables are sent to all workers. That means that large broadcast variables can cost significant overhead in having to transfer them between the client and workers.

Therefore, optimize parfor loops by trying to use a more slice variables and keep small any necessary broadcast variables to reduce parallel overhead. Another common parallel construct is parfeval, short for parallel f eval. This parallel construct is similar to parfor loops because it utilizes parallel workers to run multiple tasks in parallel.

The difference is it operates only on functions and it is asynchronous or non-blocking with respect to MATLAB. Unlike parfor loops, parfeval allows you to continue executing commands in MATLAB while the parallel work completes in the background. Parfeval creates a queue of tasks that each execute a function on a parallel worker.

The queue is such that the next item in the queue is always executed on the next available worker in the pool, thereby preserving order of execution. After tasks are queued up for execution, you are free to use MATLAB on other tasks without having to wait on the queued tasks. When they are done, you can retrieve results of the computation using fetchnext.

You can also add or delete tasks from the queue. Parfeval also distributes work to the parallel workers in a different manner than parfor. As you saw, instead of transferring groups of tasks to the parallel workers, parfeval transfers one task at a time. If your tasks or iterations have significantly different run times, parfeval will help avoid idle workers that can be caused by grouping tasks.

That being said, parfor will still most likely be your go to solution, but keep parfeval in mind if need to preserve the order of execution, if you need a parallel queue, or if you would like to keep using MATLAB while it performs computations in the background. The data queue allows you to pass data from the parallel workers back to the MATLAB client. One useful application of this is being able to view the progress of your parallel computation.

To get started, we construct the data cube and create a weight bar, which we will use to view our progress. We then specify what action will take place every time a worker triggers the data queue. In this case, we want to run a function and update weight bar, which will update the weight bar. We use the after each construct to trigger the end update weight bar function after each iteration in the parfor as indicated by the send construct.

You can also use after each with parfeval. As the parfor runs, the workers notify the client after their calculation is finished. This triggers the end update weight bar function, which updates the indicated progress of the parallel workers.

The key components of this workflow are data q. After each, send in the function you provide to run after each iteration. While you can technically use parfor within Simulink, parfor was designed primarily for MATLAB and is not recommended to be used with Simulink. Instead, use parsim to run multiple simulations in parallel.

Parsim distributes multiple simulations to multicore CPUs to speed up overall simulation time. It automates the creation of parallel pools identifies file dependencies, and manages build artifacts. It also works in conjunction with a new simulation input object, which helps you setup all your simulation inputs in a convenient way, including variables, block parameters, and simulation configurations.

At this point, we've talked about automatic parallel support and common programming constructs. Parallel Computing Toolbox also offers advanced parallel constructs for the most control of your resources, such as configuring parallel workers to communicate with one another and past data, splitting large matrices across the memory of multiple machines, and working with large repositories of data. Since these topics may not have broad appeal, if you'd like to learn more, you can check the resources at the end of this presentation.

Product documentation and technical support can also help you with your questions. Parallel Computing Toolbox enables you to use NVIDIA GPUs to accelerate AI, deep learning, and other computationally intensive analytics without having to be a [? QDA ?] programmer. MATLAB has hundreds of functions with support for using NVIDIA GPUs use and you'll be able to access multiple GPUs on a desktop or cluster, generate [? QDA ?] code and more.

To use a GPU on your workstation, you simply need MATLAB and Parallel Computing Toolbox. You also need to make sure you have a supported NVIDIA GPU device with a recent graphics driver. It's best practice to ensure you are the latest driver for your device. GPUs have hundreds-- sometimes thousands-- of cores with a very focused instruction set.

In MATLAB desktop or a single worker is all that is needed to take advantage of an entire GPU. In deep learning toolbox, functions like train network can use a GPU if you set a flag and have a suitable GPU. In addition, hundreds of functions in MATLAB and other toolboxes are overloaded to use a GPU if you supply a GPU array argument, which we will cover shortly.

The link at the bottom takes you to the documentation for running MATLAB functions on a GPU, where you can also find a list of GPU supported functions. Not all problems are suited for the GPU. Ideal problems for GPU computing are massively parallel and/or computationally intensive. Massively parallel means that the computations can be broken down into hundreds or thousands of independent units of work.

You will see the best performance when all of the cores are kept busy, exploiting the inherent parallel nature of the GPU. Computationally intensive means that the time spent on computation significantly exceeds the amount of time spent on transferring data to and from GPU memory. GPUs have a high speed memory bus for data transfer within the GPU. However, the GPU has to use the much slower PCI express bus to communicate with the CPU.

This means that your overall computational speedup will be reduced by the amount of time it takes to transfer data between devices, as required for your algorithm MATLAB developers have written [? QDA ?] versions of key MATLAB and toolbox functions, which are presented as overloaded functions. The GPU version of a function will run when the input is in GPU memory. In the example to the right, we initially create a matrix on the CPU.

Using GPU array, we send a copy of that matrix to the GPU. We then execute an fft function on that matrix, and note that even though there is no explicit instruction to use a GPU, MATLAB will see that the matrix resides on the GPU and we use the GPU instead of the CPU to perform the computation. This means that you can use your GPU for faster computation, while still using these same underlying code. After the computation is complete, you can gather the results and view them in MATLAB as normal. You can further accelerate your code using advanced GPU [? QDA ?] and MEX programming.

You can learn more by consulting product documentation and technical support. Now that we have built up a fundamental understanding of parallel computing in MATLAB, let's discuss how we can start migrating our workflow to a cluster or cloud for even more computational power. After all, the problems or challenges you work on might need additional computational resources or memory beyond what is available on a single multicore desktop machine.

MATLAB parallel server enables you to scale up your desktop workflow to access the additional computational power and memory of multiple computers in a cluster, whether on premise in your organization, or on the cloud. While setting up the infrastructure might require support from your IT staff, you can send jobs to a cluster without ever leaving the MATLAB desktop environment. And in the same spirit of other parallel computing tools in MATLAB, the code you developed on your desktop machine can be run on the cluster without having to recode your underlying algorithms. You can use a cluster for additional computing power or simply to free up your desktop computer for other work, which we'll soon discuss with a batch workflow.

To scale to a cluster, you'll need MATLAB and Parallel Computing Toolbox box on your MATLAB client workstation, along with the licenses for any other tool box required by your code. On the cluster side, you only need MATLAB Parallel Server. Instead of checking out toolbox licenses, each MATLAB parallel server worker dynamically licenses toolboxes and block sets to match the licenses from the submitting client.

You can select where to run your code by defining it programmatically in your code or through the MATLAB UI using cluster profiles. By default, you will have the local profile, which will run workers on your MATLAB client. You can create or import other profiles, which will point you to workers on remote hardware.

You can have multiple profiles for access to different cluster environments, and you could submit jobs to different profiles from the same MATLAB session. You can even have an interactive parallel pool on a cluster, which is useful for debugging and prototyping. Note that your MATLAB client can only interact with one parallel pool at a time.

Parallel computing in MATLAB supports cross platform submissions, which means that the operating system on which your MATLAB client runs can be different from the cluster operating system. Since MATLAB syntax is the same on all platforms, there is no need to rewrite your algorithms. Of course, you'll need to ensure that your code does not use hardcoded operating system specific file references Parallel Computing Toolbox includes features like additional paths and attached files that help to resolve potential issues of sharing code and data with workers on a cluster.

Depending on your system and network configuration, you can use workers on the cluster interactively with par pool, which is useful for prototyping and debugging. For long running jobs, you will want to transition to batch workflows, which we previously mentioned in passing. You can send your code using the batch command to run on remote hardware where MATLAB Parallel Server has been installed and configured.

Using batch off loads of work from your computer so that your machine is no longer tied to the computation. That means you can do something else, put your machine to sleep, or even turn it off. By default, the command will request a single worker for serial computation, but you can include the pool argument to request multiple workers for parallel computation.

You can submit multiple batch jobs and the scheduler on the cluster side will schedule work as resources become available. You can check the state and progress of your job with these states in diary commands, which can use programmatically through the MATLAB command window and interactively through the job monitor. After the job is finished, you can retrieve the results in MATLAB or view artifacts generated from the job on the file system of the cluster.

You can also leverage batch processing with Simulink. Batchsim works similarly to the batch command by offloading your simulations to MATLAB Parallel Server on remote hardware, freeing up your desktop resources. Note that if you're already using parsim in your code, you can change parsim to batchsim, specify the pool arguments, and run the simulations in batch. After completing, you have retrieved simulation results at your convenience for further processing on your desktop.

With a more complete understanding of parallel tools in MATLAB, we can see how MATLAB provides a single, high performance environment for working with big data on your desktop or on a cluster. There are MATLAB capabilities customized for both beginners and power users of big data applications. You'll be able to use constructs like data stores and tall arrays to access data that does not fit in memory.

You'll be able to use constructs like data stores and tall arrays to access data that does not fit in memory, use data from Hadoop Distributed File System, or HDFS, access cloud based storage, and create repositories of large amounts of images, spreadsheets, and custom files. While you'll have to learn about a few more functions, they all use the same intuitive MATLAB syntax with which you're already familiar. You can prototype algorithms quickly using small data sets and then scale up using these same code to big data sets stored in and process on large clusters.

Tall arrays provide a way to visualize, parse, and analyze data, even if comprised of millions or billions of rows too big to fit in your machine's random access memory, or RAM. They are backed by a data store, which is a repository of a large amount of files that are located on a machine, the cloud, or the cluster. Many operations and functions are overloaded to work with tall arrays, using these same syntax as you would use with normal in memory MATLAB arrays. Tall arrays wrap around one of these data stores and treat the entire set of data as one continuous table or array.

The underlying data store enables calculations to work through the array one piece at a time. All this is done behind the scenes, and to you as the user, you simply write what looks like normal MATLAB code. For example, if we build a tall array from multiple CSV files containing tabular data, the resulting tall array is a tall table. Even though this table doesn't fit in memory, we can use standard table functions like summary or dot references to access the columns and then use max, min, plus, minus, just as we would for a regular, in memory table.

Since the pieces are processed independently, you can paralyze this using Parallel Computing Toolbox and process several pieces at a time. Naturally, you can scale up tall arrays across multiple computers using MATLAB Parallel Server. Distributed arrays are a parallel data type that uses the memory of multiple machines to store variables that are too large to store on a single machine.

Using distributed arrays, you can distribute your matrices across multiple machines and go beyond the capabilities of a single computer. You can prototype distributed array workflows on the desktop using Parallel Computing Toolbox, and then scale up to a cluster, which uses MATLAB Parallel Server. In the same spirit of trying to make things easy and simple, a large number of standard MATLAB functions work with distributed arrays using the same syntax as normal arrays, again, as overloaded functions.

That means you can program and distribute an array algorithm in the same manner as an in-memory algorithm, and MATLAB will run the right version of the code based on the input data type. This enables you to take advantage of distributed computing without needing to be an expert in message passing. In this example, we develop and prototype an algorithm using distributed arrays on a local machine with a small data sample.

Once we are confident that our algorithm works, we need only to change our cluster profile to run the same algorithm on the entire data set and scaled up on the cluster. Using a data store, we can access multiple files, which each contain a portion of a matrix which we will use in our calculation. We use distributed with a data store to allow the data to be spread across the pool of workers in a way that allows the matrix to be processed as a single entity.

We can test locally with a small matrix stored in this single file or small set of files and then easily scale up by changing the profile to access a cluster and use a data store that accesses the entire data set, which will comprise a much larger matrix. Note that outside of the circled code, the rest of the code is exactly the same. As we wrap up, we'll also leave you with some resources for additional information. Hopefully, you've been able to see repeatedly the theme of MATLAB making it easy for you to use powerful computational resources and techniques so you can focus on your algorithms and research.

You don't need to be a parallel programming expert to get started, and you can always dig deeper into more advanced techniques if you want to get even more performance out of your resources. Those resources include the hardware already available to your machine, as well as additional computational power from GPUs or a cluster of machines. And whether you're developing serial or parallel algorithms, you can develop and prototype them locally using the familiar MATLAB syntax, then scale up to clusters or to the cloud without having to rewrite your underlying code.

Here are some resources regarding the topics mentioned in this presentation. Please feel free to reach out to us for any questions in regards to these topics and more. Technical support or your account manager we'll also be glad to help answer any questions you may have.