# Distributing individual tasks over my pool of workers, scaling should be perfect

5 ビュー (過去 30 日間)
Alexis Bottero 2020 年 11 月 16 日
コメント済み: Edric Ellis 2020 年 11 月 19 日
Here is a parallel computing problem that should be trivial: distributing individual tasks (of equivalent duration) over a pool of workers.
parpool(4) % Open a pool of 4 workers
Ntasks = 4; % Number of tasks to compute (chosen equal to the number of workers for simplicity)
N = 2000; % Big number (not too big, of course, to avoid memory issues)
%% First strategy: serial computing
big_matrix_or_whatever = zeros(N,N);
tic
% norm(pinv(...)) is a dummy example of a costly task
complicated_thing_to_compute(i) = norm(pinv(big_matrix_or_whatever));
end
toc % ~ 2.6s
%% Second strategy: parfor
big_matrix_or_whatever = zeros(N,N);
tic
% norm(pinv(...)) is a dummy example of a costly task
complicated_thing_to_compute(i) = norm(pinv(big_matrix_or_whatever));
end
toc % ~ 8s
%% Third stategy: drange
big_matrix_or_whatever = zeros(N,N);
tic
% norm(pinv(...)) is a dummy example of a costly task
complicated_thing_to_compute(i) = norm(pinv(big_matrix_or_whatever));
end
res = gather(complicated_thing_to_compute, 1);
toc % ~ 2.7s
%% Correct strategy that runs in ~ 0.7 s? Maybe with spmd?
Note that this is a trivial problem to tackle with MPI: each proc receive a list of indices 'i' (just one per proc in this example) and do its thing independently.
Coded in MPI (I've done it in Python, C and Fortran) the scaling is perfect: computational time is exactly inversely proportional to the number of workers. There is absolutely no reason why it should be any different. Of course, in the end (if needed) everybody have to send their results to everybody and this can take a bit of additional time. This must be negligible in this specific example though (procs just exchange a few floats)

サインインしてコメントする。

### 回答 (2 件)

Walter Roberson 2020 年 11 月 16 日
Note that this is a trivial problem do tackle with MPI, each proc receive a list of indices 'i' (just one per proc in this example) and do its thing independently.
Coded in MPI (I've done it in Python, C and Fortran) the scaling is perfect: computational time is exactly inversely proportional to the number of workers.
You are assuming that each thread is taking the same amount of time (to within statistical noise) so that you can pre-determine the indices to send for processing. parfor() does not assume that: it allocates roughly 2/3 of the tasks the first time around, then it allocates most of the rest in fixed-sized (but smaller) chunks, and then it runs down the remaining tasks individually. This way, if some threads take different lengths of time, CPUs do not go idle until there is no remaining work for them.
Anyhow, what you are probably looking for is threads-based parfor, which is available as of R2020b, the release after what you have.

#### 3 件のコメント

Alexis Bottero 2020 年 11 月 17 日
Thank you for your answer. You are right, I forgot to precise that in this simple example the individual tasks were taking roughly the same amount of time, I've corrected this.
If I understand your answer well, you are telling me that getting a perfect scaling for this kind of perfect-all-independent setting is impossible with Matlab version < R2020b?
This is terrible news. On the problem we are working on, running on my 32 cores desktop machine (this is 2020, come on) using python+mpi I run in 0h45 when matlab version can't run in less than a day... Crazy. Hope the R2020b solve this issue.
Here is the code in MPI, matlab-style, it is trivial:
rank = MPI.COMM_WORLD.Get_rank(); % Rank of current process
size = MPI.COMM_WORLD.Get_size(); % Number of processes
if mod(i,size) != rank
continue
end
complicated_thing_to_compute(i) = f(i); % Function to compute (whatever it is)
end
It seems to me absolutely essential for the future to include support for MPI in Matlab.
Anyway, I gonna update to R2020b and test threads-based parfor. Fingers-crossed.
Walter Roberson 2020 年 11 月 17 日
Walter Roberson 2020 年 11 月 17 日
I indicated that spmd is MPI based. It is, but remember that that is passing messages between processes without any certainty that you can use shared memory even for the message passing. Shared memory for message passing is an optimization as far as MPI is concerned. And you have to get the big array to the processes, which again is not certain to use shared memory.
How does MATLAB transfer initial data to the workers for the parallel technologies that are not spmd? That is not specified, and might be different compared to transfer for spmd. It is specified that the process is equivalent to save() and transfer the byte stream and load(), which rules out shared memory of original objects but the data transfer itself could be different ways.
I did trace through the data transfer for parallel queues at one point. It involved using Java queues. I do not recall at the moment how the Java queues transferred data.
Anyhow, data transfer costs add up.

サインインしてコメントする。

Edric Ellis 2020 年 11 月 17 日
There are several confounding factors here that go some way at least to explaining your timings.
In your first example (the for loop running in the client), the call to MATLAB's pinv is intrinsically multithreaded - i.e. MATLAB automatically runs it on multiple threads. That means that all your computational resources are already being fully utilised in running the loop. There's nothing more to give from your computer, and the only way you're going to get a faster time is by getting more hardware.
This (somewhat) explains why your parfor loop is never going to be faster. The fact that it is slower is a little confusing. There are several possibilities here. The first thing to consider is: how many hardware threads your CPU supports. MATLAB's maxNumCompThreads can tell you that. If the answer is not 4, then you are probably not seeing the best performance. This is because (by default) parallel pool workers operate in "single computational thread" mode - so they will not intrinsically multi-thread the calls to pinv. There's overhead to getting your large-ish matrix to the workers, that will slow things down. Walter's comment about the parfor scheduling is not quite right in this specific case (the comment is correct in general) - when the loop bounds happens to be the same as the number of workers in the pool, then exactly one iteration is sent to each worker. So that aspect isn't slowing you down here. All that said, I am surprised at how poorly the parfor loop is performing in this case. I would have expected it to be probably slightly slower than the for loop.
Finally, your for-drange example isn't right - you're running on the client there, so that's why you're seeing the same time as the client. You need to put the loop inside an spmd block, i.e.
spmd
% do stuff
end
end

#### 2 件のコメント

Alexis Bottero 2020 年 11 月 18 日
Thank you for your answer and for pointing out I had forgotten spmd. Everything makes more sense now and I can refine my question: [how] can I use the serial implementation of a function like pinv (not the parallel one) to improve the efficiency of parfor/spmd? If my intuition is correct this would solve the problem.
Let me illustrate my point and give more details.
I had just picked pinv as an example of a costly function but it turned out to be a very good example actually. Let's look at the python code below:
from numpy import *
from time import time
from numpy.linalg import pinv, norm
N = 2000
big_matrix = random.rand(N, N)
tic = time()
complicated_thing_to_compute = norm(pinv(big_matrix))
toc = time()
print "Elapsed:", toc - tic
This returns in ~0.8s on one proc of my machine. Matlab equivalent:
N = 2000;
big_matrix = rand(N);
tic
complicated_thing_to_compute = norm(pinv(big_matrix));
toc
Returns in ~1.0s as well, using (as you are saying) all the available ressources (36 procs). I suppose the parallel version of pinv is not very efficient because the two langages usually compare well in term of performance. This is completely OK, computing efficiently a pseudo inversion in parallel is an open problem.
Problem is: if one has many pinv to compute it makes much more sense to parallelize differently, meaning: send one pinv to each proc instead of using all of them poorly to compute each single pinv.
Now a test case to quantify the difference in term of performance: if one have e.g 1000 pinv of 2000x2000 matrices to compute on a 36 cores machine. This takes ~half-an-hour using python+mpi (I have the benchmark code, if interested) but would take more than 20 hours in Matlab on the same machine (if, of course, it is not possible to disable parallel implementation of pinv). And the gap get worse and worse as the machine get bigger making Matlab completely obsolete.
Simple piece of code to illustrate:
Nworkers = Ntasks; % Number of workers to use
parpool(Nworkers); % Start the pool
task_type = 1; % Does not work ! The parallelization is useless
%task_type = 2; % Work like a charm ! Parallelization is efficient
tic
end
toc % Computational time in serial
tic
spmd
end
end
toc % -> Computational time should be ~Ntasks times quicker than above
% Example of complicated task of two different types
N = 2000;
f = norm(pinv(rand(N+n)));
else % Task with no function written in parallel
N = 20000;
f = 0;
for i=1:N
for j=1:N
f = f + n*(i+j);
end
end
end
end
If my intuition is correct, disabling the parallel implementation for all the functions in use when task_type == 1 will improve the efficiency big time.
Edric Ellis 2020 年 11 月 19 日
I'd like to focus on one piece of your comment there. You said:
if, of course, it is not possible to disable parallel implementation of pinv
The workers running your parfor loop do have the parallel implementation disabled (by default) - as I mentioned, with default settings, workers run in "single computational thread" mode.
So, I do think parfor is a viable option when you have e.g. 1000 pinv-type tasks. Here's a simple experiment I ran on the largest machine I could find around here - a 32-core machine.
% Launch a pool with one worker per "real" core.
% Input data
A = rand(2000);
% Run 10 times locally.
t = tic();
for idx = 1:10
norm(pinv(A));
end
serialTime1 = toc(t) / 10
t = tic();
norm(pinv(A));
end
On the machine I tried, I got 1.2 seconds for serialTime1, and 0.6 seconds for parallelTime1. This shows that on this machine, I can get twice as much work done per unit wall-clock time using parfor.

サインインしてコメントする。

R2020a

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by