Why does performance of functions saturate with number of cores using parfeval but not with parfor?

Question

Joseph Smalley 2020 年 6 月 30 日

1
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/557377-why-does-performance-of-functions-saturate-with-number-of-cores-using-parfeval-but-not-with-parfor

コメント済み: Joseph Smalley 2020 年 8 月 11 日

I am developing an application that MUST take advantage of parallelization, and ideally offer real-time updates after each iteration, which makes use of parfeval prefarable. I believe the algorithm that I have developed is highly parallelizable (see attached for performance of 'WT_Ex_2_b' as a function of number of cores used in parfeval function). From 1 to 8 cores, the speedup factor agrees with theoretical expectation (Amdahl's Law with p=0.95), however, performance of my application saturates at 8 cores. This led me to create a dummy function (see attached script) to compare the performance of using parfor and parfeval as a function of number of cores. I discovered that the parfor version behaves quite similarly to theoretical expectation (Ahmdal's Law, also with p=0.95), however the parfeval version continues to show strange saturation behavior, even for the dummy function. Notice how the Speedup factor improves with core number upto 12 cores, then suddenly no further improvement is observed. I have attached the script in case you want to reproduce this behavior on your end.

Is there a fundamental limitation to the number of cores the parfeval function can leverage? Or is there an obvious mistake I am making in the way I am using the parfeval function? Why does the performance behavior of the dummy algorithm suddenly saturate at 12 cores? Any recommendation how to use the parfeval function to perform as well as parfor?

I would like to emphasize that I have already developed my application to use parfeval, so converting to parfor would be time-consuming and prevent me from utilizing the update-after-iteration feature of parfeval.

Thank you for your help on this critical matter.

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Rik 2020 年 6 月 30 日

I'm not sure what people could do with it, but I think I would redact that license number. I'm on mobile now, so it's a pain to edit it away for you.

Joseph Smalley 2020 年 6 月 30 日

Thanks, done.

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Edric Ellis 2020 年 7 月 1 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/557377-why-does-performance-of-functions-saturate-with-number-of-cores-using-parfeval-but-not-with-parfor#answer_459559

The main difference between parfor and parfeval is that in the parfeval case, you are responsible for scheduling the work on the workers. parfor has an advantage over parfeval in that it knows how many loop iterations there are, and so what it does is schedule a fixed number of chunks of work per worker (see the documentation for parforOptions - the chunks are referred to as "sub-ranges"). So, in your case, parfeval will incur more overhead since each parfeval request is sent on its own to a worker, where as parfor groups things together, and this will generally be more efficient in the case where the request durations are of a similar duration to the overheads of making a single remote request.

So, parfeval doesn't have a fundamental limitation in this regard, but you might need to amalgamate your requests if they are too short to match parfor performance. Another option might be to use parfor together with DataQueue which would let you perform updates at the client after each parfor iteration completes.

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

Joseph Smalley 2020 年 7 月 3 日

MATLAB Online で開く

Below is the code snippet: W_temp and inputRays are array objects of two different user-defined handle classes. Effectively this algorithm updates their properties on each iteration. Using parfeval I am able to output both objects into the objListFuture variable on each iteration, then use their updated properties. Perhaps I may have used the wrong terminology in my previous comment, but this algorithm enables effective updating of both objects. It is unclear to me how I could achieve the same result via parfor, but based on your initial answer, it seems preferable to use parfor.

    objListFuture(1:N) = parallel.FevalFuture;
    W_temp(N,1) = W;
    for i=1:N
        W_temp(i) = W;
    end
    for i=1:N
        [objListFuture(i)] = parfeval(@IterTrace_oneParent_par,2,W_temp(i),inputRays(i));
        disp(['Iteration #', num2str(i), ' of ', num2str(N) ' is initialized.'])
    end
    
    objListNow           = cell(N,W.numObj);  
    maxSegPerRay         = W.maxSegments*W.maxBranches;
    rayListAll_origin    = zeros(3,N,maxSegPerRay);
    rayList_length       = zeros(N,1);
    
    for i=1:N
        % return index of fetch and object list
        [cindx,valOut_1,valOut_2]   = fetchNext(objListFuture);
        objListNow(cindx,:)         = valOut_1;
        rayList                     = valOut_2;    
        rayList_length(cindx)       = length(rayList);
        % convert updated property of inputRays object to numerical array
        rayListAll_origin(:,cindx,1:rayList_length(cindx))     = [rayList.origin];
        disp(['Iteration #', num2str(i), ' of ', num2str(N)  ' complete. (Ray #', num2str(cindx)]);
        % update property of W object using objListNow 
        for j=1:W.numObj    
            if class(W.objects{j})=="Detector"   
               if ~isempty(objListNow{cindx,j}.hitList)
                   W.objects{j}.hitList = objListNow{cindx,j}.hitList;
               end
            end
        end           
    end

Joseph Smalley 2020 年 7 月 7 日

MATLAB Online で開く

Edric, thank you for this snippet as it was helpful and I now have the parfor loop updating my objects. HOWEVER, unfortunately some unexpected behavior appears that does not appear when using parfeval or serialized for-loop. Namely, when the number of iterations of the parfor loop exceeds the number of workers in my paralell pool, the results deviate from the expected result, and the problem becomes worse as the number of iterations increases. Without going into too much application detail, the result becomes increasingly "unphysical" as the number of iterations exceeds the number of workers.

After some debugging I have found that because objList contains handle objects, the properties of these objects do appear to be updated on each "worker cycle", such that the objects contained in objList(7,:) are updates, rather than copies, of the objects contained in objList(1,:), assuming use of 6 workers. From your earlier comment, I thought this would not be the case.

Does this seem expected to you?

Is there anything in the new code (below) that appears as a red flag to you?

    %% parfor algorithm
    objList              = cell(N,W.numObj);
    maxSegPerRay         = W.maxSegments*W.maxBranches;
    rayListAll_origin    = zeros(3,N,maxSegPerRay);
    rayList_length       = zeros(N,1);
    parfor i=1:N
        [objList(i,:),rayList]      = IterTrace_oneParent_par(W,inputRays(i)); 
        rayList_length(i)           = length(rayList);
        zeroList_length             = maxSegPerRay - rayList_length(i);
        % convert updated ray objects into double arrays with consistent size; zeros are removed later
        rayListAll_origin(:,i,:)    = [rayList.origin, zeros(3,zeroList_length)];
    end
    
    % update property of detector objects contained in W object
    for i=1:N
        for j=1:W.numObj    
            if class(W.objects{j})=="Detector"   
               if ~isempty(objList{i,j}.hitList)
                   W.objects{j}.rays = objList{i,j}.rays;
               end
            end
        end
    end

Joseph Smalley 2020 年 7 月 13 日

MATLAB Online で開く

Edric, after taking a break and returning to this problem, your last above recommended code still behaves in the non-physical way from my previous comment. The disp() line within my parfor loop (code below) checks a property of both the W_temp and W object. All values of "Src Intensity" should be equal to ~1. However after exceeding a multiple of the number of workers in my pool (6), the properties experience a "step-like" behavior in that one of the workers sees an updated object rather than the original object (before the parfor loop begins). Below shows Src Intensity jumping to 2 on the 7th worker and 3 on the 13th worker.

N=7

Iteration #3 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #5 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #4 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #2 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #1 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #7 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #6 of 7 complete. Src Intensity(temp)=2.0002, Src Intensity(main)=2.0002

----

N=13

Iteration #2 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #3 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #5 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #4 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #1 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #6 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1

Iteration #8 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2

Iteration #11 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2

Iteration #10 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2

Iteration #9 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2

Iteration #7 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2

Iteration #13 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2

Iteration #12 of 13 complete. Src Intensity(temp)=3.0001, Src Intensity(main)=3.0001

It then appears that W is updated, WITHIN the parfor loop, after a completed cycle of 6 workers. However upon completion of the parfor loop, only the W_temp object is updated. Hence I need to "manually" update the properties of W with a serial for loop, which is OK. The problem is that I do not want W_temp or W to be updated within the parfor loop after completion of a multiple of the size of the parallel pool. I want all workers to see the original W object for all iterations. Is this possible? Thank you for your continued assistance.

    % pre-allocation
    maxSegPerRay         = W.maxSegments*W.maxBranches;
    rayListAll_origin    = zeros(3,N,maxSegPerRay);
    rayList_length       = zeros(N,1);
    % W is the handle object whose properties include other handle classes that we want to update
    W_temp(N,1) = W;
    for i=1:N
        W_temp(i) = W;
    end
    parfor i=1:N
        % Convert broadcast variable into temporary variable
        W_temp2 = W;
        % Call main function
        [~,rayList]                 = IterTrace_oneParent_par(W_temp2,inputRays(i));
        % Convert updated rayList properties to numeric array (not a problem)
        rayList_length(i)           = length(rayList);
        zeroList_length             = maxSegPerRay - rayList_length(i);
        rayListAll_origin(:,i,:)    = [rayList.origin,      zeros(3,zeroList_length)];
        % Update W_temp object and display Src Intensity of W_temp and W (should always be ~1)
        W_temp(i) = W_temp2;
        disp(['Iteration #', num2str(i), ' of ', num2str(N)  ' complete.  Src Intensity(temp)=', num2str(sum([W_temp(i).objects{5}.rays.intensity])),...
            ',  Src Intensity(main)=', num2str(sum([W.objects{5}.rays.intensity]))]);
    end
    % Note: W_temp and W are both updated WITHIN the parfor loop after subRange is complete, but only W_temp is updated on completion of the parfor loop
    % "Manually" update properties of detector objects contained in W
     for i=1:N
        for j=1:W.numObj    
            if class(W.objects{j})=="Detector"                  
               if ~isempty(W_temp(i).objects{j}.rays)
                   W.objects{j}.rays    = W_temp(i).objects{j}.rays;
               end
            end
        end
     end
    

Joseph Smalley 2020 年 8 月 11 日

Just wanted to say that I accepted this answer because, overall, the problem is addressed more easily by switching to a parfor loop, as Edric first proposed. Additionally I switched all my classes to value classes over handle classes. The latter is a compromise for my application, and was first motivated by requirements of codegen for MEX files. Nonetheless the combination of parfor with values classes has been working for several weeks now, with pretty good scalability of 12x at 24 cores.

サインインしてコメントする。

Why does performance of functions saturate with number of cores using parfeval but not with parfor?

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

採用された回答

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

Why does performance of functions saturate with number of cores using parfeval but not with parfor?

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

採用された回答

8 件のコメント 6 件の古いコメントを表示6 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

8 件のコメント
6 件の古いコメントを表示6 件の古いコメントを非表示