Loops in parfor are overly slow in a very simple code

Question

Arabarra 2020 年 2 月 18 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/506209-loops-in-parfor-are-overly-slow-in-a-very-simple-code

コメント済み: Arabarra 2020 年 2 月 19 日

採用された回答: Jacob Wood

MATLAB Online で開く

Hi,

I'm strugging to understand why a code that I wrote scales so poorly when ran in parallel using parpool.

The operation that I want to run in parallel is in the function unitLinear

function unitLinear(L,N)
a = rand(L,L,L);
b = rand(L,L,L);
for i=1:N
    for j = 1:N
        c  = a.*b;
    end
end

which does nothing useful, it is just a decoy to measure execution performance. ( If you are curious, it models the first step of Principal Component Analisis of a set of N volumes, each with LxLxL pixels, by computing all pairwise correlations)

My approach to run this function in parallel is with the following script testUnitLinear:

nTasks = 40; % number of taks to be executed in parallel
L = 128;        % cube sidelength  in pixels (inside testing units)
N = 100;       % length of interior loop inside testing unit
fprintf('Results for L:%d  N:%d \n',L,N);
% test apart
disp('Computing testing unit in single core');
tInitialUnit= clock();
unitLinear(L,N);
timeUnitSingle= etime(clock(),tInitialUnit);
fprintf('Testing unit in single core: %5.2f \n',timeUnitSingle);
tUnitArray = zeros(nTasks,1); % to store the time seen inside the loop
%tUnitArray = distributed(tUnitArray);
t1 = clock();
parfor i=1:nTasks
    
     tInitialUnit= clock();
     unitLinear(L,N);
     timeUnit= etime(clock(),tInitialUnit);
     
     tUnitArray(i) = timeUnit;
     fprintf('Testing unit time %5.2f \n',timeUnit);
end
tTotal= -etime(t1,clock);
fprintf('Total time: %5.2f \n',tTotal);
fprintf('Sum process time: %5.2f \n',sum(tUnitArray));
fprintf('Average process time: %5.2f \n',sum(tUnitArray)/nTasks);
fprintf('Unit in single core: %5.2f \n',timeUnitSingle);

When ran outside the parfor, the first execution of unitLinear took about 10 seconds in my system... and inside the parfoor loop (using a parpool openend with the local profile), each execution was indeed reporting about 10 seconds. So far so good. As I was working with a open pool of 16 workers (as seen here)

>> a = gcp
a = 
 Pool with properties: 
            Connected: true
           NumWorkers: 16
              Cluster: local
        AttachedFiles: {}
    AutoAddClientPath: true
          IdleTimeout: 30 minutes
          SpmdEnabled: false

.... I was expecting that the execution of the parfor loop would amount to (10 seconds per task X 40 tasks) / 16 workers = approx 25 seconds. However, the time measured by the user was around 400 seconds! As if no parallelization would take place at all! Even if htop was reporting all requested cores working (I have no other processes running in this machine)... and the fans were in fact busy as hell.

This is something that I didn't expect. I kinow that a speedup of celan 16x would be asking too much, but a speed up of 1x on 16 local cores is too unexpected, as the parfor loop couldn't bee more simple. No files, no shared variables... nothing I can think of... or am I missing something too evident? My main problem is that I cannot distinguish if this is somehow expected behavior or the symptom of something going terribly wrong in my system...

Any help is welcome!

I paste here the results of executing the code....

     >> tryUnitLinear
Results for L:128  N:100 
Computing testing unit in single core
Testing unit in single core:  9.83 
Testing unit 40 time  9.77 
Testing unit 39 time  9.68 
Testing unit 38 time  9.47 
Testing unit 37 time  9.57 
Testing unit 36 time  9.83 
Testing unit 35 time  9.67 
Testing unit 34 time 10.01 
Testing unit 33 time  9.62 
Testing unit 32 time 10.54 
Testing unit 31 time 10.40 
Testing unit 30 time 11.69 
Testing unit 29 time  9.54 
Testing unit 28 time  9.51 
Testing unit 27 time  9.59 
Testing unit 26 time  9.80 
Testing unit 25 time 10.23 
Testing unit 24 time 10.14 
Testing unit 23 time 10.17 
Testing unit 22 time  9.37 
Testing unit 21 time  9.33 
Testing unit 20 time 10.05 
Testing unit 19 time 11.25 
Testing unit 18 time 10.37 
Testing unit 17 time  9.73 
Testing unit 16 time  9.95 
Testing unit 15 time 10.30 
Testing unit 14 time  9.41 
Testing unit 13 time 10.47 
Testing unit 12 time 10.27 
Testing unit 11 time  9.52 
Testing unit 10 time  9.67 
Testing unit 9 time  9.73 
Testing unit 8 time 12.17 
Testing unit 7 time  9.96 
Testing unit 6 time 10.17 
Testing unit 5 time  9.81 
Testing unit 4 time  9.92 
Testing unit 3 time  9.33 
Testing unit 2 time  9.35 
Testing unit 1 time  9.70 
Total time: 469.00 
Sum process time: 399.08 
Average process time:  9.98 
Unit
 in single core:  9.83      
 

(by the way, I find it rather weird that the parfor visits i in exactly the reverse order of integers -I was expecting a totally random access pattern- but cannot imagine if it has some relationship with the problem I described)

thanks in advance!

Daniel

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Jacob Wood 2020 年 2 月 18 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/506209-loops-in-parfor-are-overly-slow-in-a-very-simple-code#answer_416194

Matlab actually multithreads element-wise multiplication, thus using all available cores in the "single core" case and no additional performance from the parfor implementation. See this link for more information:

https://www.mathworks.com/matlabcentral/answers/95958-which-matlab-functions-benefit-from-multithreaded-computation

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Arabarra 2020 年 2 月 19 日

MATLAB Online で開く

Yesd, definitely multithreading is not an issue. In the unit function I replaced the elementwise multiplication (see below ) with an explicit loop and I get the same results...

function unitLinear(L,N)
a = rand(L,L,L);
b = rand(L,L,L);
c = zeros(L,L,L);
for i=1:N
    for j = 1:N
        %c     = a.*b;
         
        for k=1:L
            for m=1:L
                c(i,j) = a(i,j)+b(i,j);
            end
        end
    
    end
end

Arabarra 2020 年 2 月 19 日

update: I restarted the computers and you were right: htop now shows me that the matrix multiplication is multithreading like hell. Well, not the result I needed (as it means I cannot gain computing time :-(), but at least I know why... thanks!

Daniel

サインインしてコメントする。

Loops in parfor are overly slow in a very simple code

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

Community Treasure Hunt

Loops in parfor are overly slow in a very simple code

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示