This example shows how to use the `dspunfold` function to generate a multithreaded MEX file from a MATLAB® function using unfolding technology. The MATLAB function can contain an algorithm which is stateless (has no states) or stateful (has states).

NOTE: The following example assumes that the current host computer has at least two physical CPU cores. The presented screenshots, speedup, and latency values are collected using a host computer with six physical CPU cores.

Required MathWorks™ products:

• DSP System Toolbox™

• MATLAB Coder™

### Using dspunfold with a MATLAB Function Containing a Stateless Algorithm

Consider the MATLAB function `dspunfoldDCTExample`. This function computes the DCT of an input signal and returns the value and index of the maximum energy point

`type dspunfoldDCTExample.m`
```function [peakValue,peakIndex] = dspunfoldDCTExample(x) % Stateless MATLAB function computing the dct of a signal (e.g. audio), and % returns the value and index of the highest energy point % Copyright 2015 The MathWorks, Inc. X = dct(x); [peakValue,peakIndex] = max(abs(X)); end ```

To accelerate the algorithm, a common approach is to generate a MEX file using the `codegen` function. This example shows how to do so when using an input of 4096 doubles. The generated MEX file, `dspunfoldDCTExample_mex`, is singlethreaded.

`codegen dspunfoldDCTExample -args {(1:4096)'}`
```Code generation successful. ```

To generate a multithreaded MEX file, use the `dspunfold` function. The argument `-s 0` indicates that the algorithm in `dspunfoldDCTExample` is stateless.

`dspunfold dspunfoldDCTExample -args {(1:4096)'} -s 0`
```State length: 0 frames, Repetition: 1, Output latency: 12 frames, Threads: 6 Analyzing: dspunfoldDCTExample.m Creating single-threaded MEX file: dspunfoldDCTExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldDCTExample_mt.mexa64 Creating analyzer file: dspunfoldDCTExample_analyzer.p ```

This command generates these files:

• Multithreaded MEX file `dspunfoldDCTExample_mt`

• Single-threaded MEX file `dspunfoldDCTExample_st`, which is identical to the MEX file obtained using the `codegen` function

• Self-diagnostic analyzer function `dspunfoldDCTExample_analyzer`

Additional three MATLAB files are also generated, containing the help for each of the above files.

To measure the speedup of the multithreaded MEX file relative to the single-threaded MEX file, see the example function `dspunfoldBenchmarkDCTExample`.

`type dspunfoldBenchmarkDCTExample`
```function dspunfoldBenchmarkDCTExample % Function used to measure the speedup of the multi-threaded MEX file % dspunfoldDCTExample_mt obtained using dspunfold vs the single-threaded MEX % file dspunfoldDCTExample_st. % Copyright 2015 The MathWorks, Inc. clear dspunfoldDCTExample_mt; % for benchmark precision purpose numFrames = 1e5; inputFrame = (1:4096)'; % exclude first run from timing measurements dspunfoldDCTExample_st(inputFrame); tic; % measure execution time for the single-threaded MEX for frame = 1:numFrames dspunfoldDCTExample_st(inputFrame); end timeSingleThreaded = toc; % exclude first run from timing measurements dspunfoldDCTExample_mt(inputFrame); tic; % measure execution time for the multi-threaded MEX for frame = 1:numFrames dspunfoldDCTExample_mt(inputFrame); end timeMultiThreaded = toc; fprintf('Speedup = %.1fx\n',timeSingleThreaded/timeMultiThreaded); ```

`dspunfoldBenchmarkDCTExample` measures the execution time taken by `dspunfoldDCTExample_st` and `dspunfoldDCTExample_mt` to process `numFrames` frames. Finally, it prints the speedup, which is the ratio between the multithreaded MEX file execution time and single-threaded MEX file execution time.

Run the example.

`dspunfoldBenchmarkDCTExample;`
```Speedup = 2.9x ```

To improve the speedup even more, increase the repetition value. To modify the repetition value, use the `-r` flag. For more information on the repetition value, see the `dspunfold` function reference page. For an example on how to specify the repetition value, see the section 'Using dspunfold with a MATLAB Function Containing a Stateful Algorithm'.

`dspunfold` generates a multithreaded MEX file, which buffers multiple signal frames and then processes these frames simultaneously, using multiple cores. This process introduces some deterministic output latency. Executing help `dspunfoldDCTExample_mt` displays more information about the multithreaded MEX file, including the value of the output latency. For this example, the output of the multithreaded MEX file has a latency of 16 frames relative to its input, which is not the case for the single-threaded MEX file.

Run `dspunfoldShowLatencyDCTExample` example. The generated plot displays the outputs of the single-threaded and multithreaded MEX files. Notice that the output of the multithreaded MEX is delayed by 16 frames, relative to that of the single-threaded MEX.

`dspunfoldShowLatencyDCTExample;` ### Using dspunfold with a MATLAB Function Containing a Stateful Algorithm

The MATLAB function `dspunfoldFIRExample` executes two FIR filters.

`type dspunfoldFIRExample.m`
```function y = dspunfoldFIRExample(u,c1,c2) % Stateful MATLAB function executing two FIR filters % Copyright 2015 The MathWorks, Inc. persistent FIRSTFIR SECONDFIR if isempty(FIRSTFIR) FIRSTFIR = dsp.FIRFilter('NumeratorSource','Input port'); SECONDFIR = dsp.FIRFilter('NumeratorSource','Input port'); end t = step(FIRSTFIR,u,c1); y = step(SECONDFIR,t,c2); ```

To build the multithreaded MEX file, you must provide the state length corresponding to the two FIR filters. Specify 1s to indicate that the state length does not exceed 1 frame.

```firCoeffs1 = fir1(192,0.8); firCoeffs2 = fir1(256,0.2,'High'); dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} -s 1```
```State length: 1 frames, Repetition: 1, Output latency: 12 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p ```

Executing this code generates:

• Multithreaded MEX file `dspunfoldFIRExample_mt`

• Single-threaded MEX file `dspunfoldFIRExample_st`

• Self-diagnostic analyzer function `dspunfoldFIRExample_analyzer`

• The corresponding MATLAB help files for these three files

The output latency of the multithreaded MEX file is 12 frames. To measure the speedup, execute `dspunfoldBenchmarkFIRExample`.

`dspunfoldBenchmarkFIRExample;`
```Speedup = 1.4x ```

To improve the speedup of the multithreaded MEX file even more, specify the exact state length in samples. To do so, you must specify which input arguments to `dspunfoldFIRExample` are frames. In this example, the first input is a frame because the elements of this input are sequenced in time. Therefore it can be further divided into subframes. The last two inputs are not frames because the FIR filters coefficients cannot be subdivided without changing the nature of the algorithm. The value of the `dspunfoldFIRExample` MATLAB function state length is the sum of the state length of the two FIR filters (192 + 256 = 448). Using the `-f` argument, mark the first input argument as true (frame), and the last two input arguments as false (nonframes)

`dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} -s 448 -f [true,false,false]`
```State length: 448 samples, Repetition: 1, Output latency: 12 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p ```

Again, measure the speedup for the resulting multithreaded MEX using the `dspunfoldBenchmarkFIRExample` function. Notice that the speedup increased because the exact state length was specified in samples, and dspunfold was able to subdivide the frame inputs.

`dspunfoldBenchmarkFIRExample;`
```Speedup = 2.0x ```

Oftentimes, the speedup can be increased even more by increasing the repetition (-r) provided when invoking `dspunfold`. The default repetition value is 1. When you increase this value, the multithreaded MEX buffers more frames internally before the processing starts. Increasing the repetition factor increases the efficiency of the multi-threading, but at the cost of a higher output latency.

```dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} ... -s 448 -f [true,false,false] -r 5```
```State length: 448 samples, Repetition: 5, Output latency: 60 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p ```

Again, measure the speedup for the resulting multithreaded MEX, using the `dspunfoldBenchmarkFIRExample` function. Speedup increases again, but the output latency is now 60 frames. The general output latency formula is $2×\mathrm{Threads}×\mathrm{Repetition}\text{\hspace{0.17em}}\mathrm{frames}$. In these examples, the number of `Threads` is equal to the number of physical CPU cores.

`dspunfoldBenchmarkFIRExample;`
```Speedup = 2.2x ```

#### Detecting State Length Automatically

To request that `dspunfold` autodetect the state length, specify `-s auto`. This option generates an efficient multithreaded MEX file, but with a significant increase in the generation time, due to the extra analysis that it requires.

```dspunfold dspunfoldFIRExample -args {(1:4096)',firCoeffs1,firCoeffs2} ... -s auto -f [true,false,false] -r 5```
```State length: [autodetect] samples, Repetition: 5, Output latency: 60 frames, Threads: 6 Analyzing: dspunfoldFIRExample.m Creating single-threaded MEX file: dspunfoldFIRExample_st.mexa64 Searching for minimal state length (this might take a while) Checking stateless ... Insufficient Checking 4096 samples ... Sufficient Checking 2048 samples ... Sufficient Checking 1024 samples ... Sufficient Checking 512 samples ... Sufficient Checking 256 samples ... Insufficient Checking 384 samples ... Insufficient Checking 448 samples ... Sufficient Checking 416 samples ... Insufficient Checking 432 samples ... Insufficient Checking 440 samples ... Insufficient Checking 444 samples ... Insufficient Checking 446 samples ... Insufficient Checking 447 samples ... Insufficient Minimal state length is 448 samples Creating multi-threaded MEX file: dspunfoldFIRExample_mt.mexa64 Creating analyzer file: dspunfoldFIRExample_analyzer.p ```

`dspunfold` checks different state lengths, using as inputs the values provided with the `-args` option. The function aims to find the minimum state length for which the outputs of the multithreaded MEX and single-threaded MEX are the same. Notice that it found 448, as the minimal state length value, which matches the expected value, manually computed before.

#### Verify Generated Multithreaded MEX Using the Generated Analyzer

When creating a multithreaded MEX file using dspunfold, the single-threaded MEX file is also created along with an analyzer function. For the stateful example in the previous section, the name of the analyzer is `dspunfoldFIRExample_analyzer`.

The goal of the analyzer is to provide a quick way to measure the speedup of the multithreaded MEX relative to the single-threaded MEX, and also to check if the outputs of the multithreaded MEX and single-threaded MEX match. Outputs usually do not match when an incorrect state length value is specified.

Execute the analyzer for the multithreaded MEX file, `dspunfoldFIRExample_mt`, generated previously using the `-s auto` option.

```firCoeffs1_1 = fir1(192,0.8); firCoeffs1_2 = fir1(192,0.7); firCoeffs1_3 = fir1(192,0.6); firCoeffs2_1 = fir1(256,0.2,'High'); firCoeffs2_2 = fir1(256,0.1,'High'); firCoeffs2_3 = fir1(256,0.3,'High'); dspunfoldFIRExample_analyzer((1:4096*3)',[firCoeffs1_1;firCoeffs1_2;firCoeffs1_3],... [firCoeffs2_1;firCoeffs2_2;firCoeffs2_3]);```
```Analyzing multi-threaded MEX file dspunfoldFIRExample_mt.mexa64. For best results, please refrain from interacting with the computer and stop other processes until the analyzer is done. Latency = 60 frames Speedup = 2.4x ```

Each input to the analyzer corresponds to the inputs of the `dspunfoldFIRExample_mt` MEX file. Notice that the length (first dimension) of each input is greater than the expected length. For example, `dspunfoldFIRExample_mt` expects a frame of 4096 doubles for its first input, while $4096×3$ samples were provided to `dspunfoldFIRExample_analyzer`. The analyzer interprets this input as 3 frames of 4096 samples. The analyzer alternates between these 3 input frames circularly while checking if the outputs of the multithreaded and single-threaded MEX files match.

The table shows the inputs used by the analyzer at each step of the numerical check. The total number of steps invoked by the analyzer is 180 or $3×\mathrm{latency}$, where $\mathrm{latency}$ is 60 in this case.

` | input1 | input2 | input3`

`------+----------------+--------------+--------------`

`Step1 | (1:4096)' | firCoeffs1_1 | firCoeffs2_1`

`Step2 | (4097:8192)' | firCoeffs1_2 | firCoeffs2_2`

`Step3 | (8193:12288)' | firCoeffs1_3 | firCoeffs2_3`

`Step4 | (1:4096)' | firCoeffs1_1 | firCoeffs2_1`

` ... | ... | ... | ...`

NOTE: For the analyzer to correctly check for the numerical match between the multithreaded MEX and single-threaded MEX, provide at least two frames with different values for each input. For inputs that represent parameters, such as filter coefficients, the frames can have the same values for each input. In this example, you could have specified a single set of coefficients for the second and third inputs.