Parallel is slower than sequential?

Question

Viviana Arrigoni 2018 年 4 月 19 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/396076-parallel-is-slower-than-sequential

回答済み: Edric Ellis 2018 年 4 月 19 日

I am new with the Parallel Toolbox, and I have many doubts. I was implementing some parallel Jacobi algorithm, and it resulted to be slower than the sequential, using the same precision threshold parameters. I tried several parallel approaches, and none seemed to be fast enough. So I tried some simpler code, as the one below:

     tic;
     ticBytes(gcp);
     n = 500;
     n_mat = 50;
     C = cell(1, n_mat);
     parfor i = 1:n_mat
          A = rand(n);
          B = rand(n);
         C{i} = A * B;
     end
     tocBytes(gcp);
     toc

and it is slower than the same, with 'for' instead of parfor. I got respectively:

             BytesSentToWorkers    BytesReceivedFromWorkers
             __________________    ________________________

    1              16016                  5.2018e+07       
    2              18152                  4.8021e+07       
    Total          34168                  1.0004e+08

Elapsed time is 1.590726 seconds.

for the parallel version,

and: Elapsed time is 0.674556 seconds.

for the sequential version.

What am I doing wrong? I also don't really understand what sliced variables are. Furthermore I noticed that using cell structures instead of arrays inside parfor doesn't give the warning of the overhead, so I always tended to prefer them, but still with the arrays things go usually faster.

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Edric Ellis 2018 年 4 月 19 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/396076-parallel-is-slower-than-sequential#answer_316045

There are a couple of reasons that your parfor loop is slower than the for loop equivalent. Firstly, there's the data transfer overhead - you're transferring quite a decent amount of data back to the client from the workers - this has to be serialized (basically like calling save on the data - but without using a file) on the worker, sent to the client, and then deserialized (equivalent of load).

Secondly, and probably most importantly for this case, if you're using only the local cluster type, then unfortunately this particular loop is pretty much guaranteed to be slower using parfor than for. That's because the for loop version is already pretty efficiently multi-threaded using mtimes - essentially, it's already taking full advantage of all the cores on your computer. The workers in a parfor loop default to running in a single-threaded mode, so each individual call to mtimes will be slower. Workers default to running in single-threaded mode to avoid overloading your computer.