MATLAB Answers

Why does performance of functions saturate with number of cores using parfeval but not with parfor?

2 ビュー (過去 30 日間)
Joseph Smalley
Joseph Smalley 2020 年 6 月 30 日
コメント済み: Joseph Smalley 2020 年 8 月 11 日
I am developing an application that MUST take advantage of parallelization, and ideally offer real-time updates after each iteration, which makes use of parfeval prefarable. I believe the algorithm that I have developed is highly parallelizable (see attached for performance of 'WT_Ex_2_b' as a function of number of cores used in parfeval function). From 1 to 8 cores, the speedup factor agrees with theoretical expectation (Amdahl's Law with p=0.95), however, performance of my application saturates at 8 cores. This led me to create a dummy function (see attached script) to compare the performance of using parfor and parfeval as a function of number of cores. I discovered that the parfor version behaves quite similarly to theoretical expectation (Ahmdal's Law, also with p=0.95), however the parfeval version continues to show strange saturation behavior, even for the dummy function. Notice how the Speedup factor improves with core number upto 12 cores, then suddenly no further improvement is observed. I have attached the script in case you want to reproduce this behavior on your end.
Is there a fundamental limitation to the number of cores the parfeval function can leverage? Or is there an obvious mistake I am making in the way I am using the parfeval function? Why does the performance behavior of the dummy algorithm suddenly saturate at 12 cores? Any recommendation how to use the parfeval function to perform as well as parfor?
I would like to emphasize that I have already developed my application to use parfeval, so converting to parfor would be time-consuming and prevent me from utilizing the update-after-iteration feature of parfeval.
Thank you for your help on this critical matter.

  4 件のコメント

表示 1 件の古いコメント
Joseph Smalley
Joseph Smalley 2020 年 6 月 30 日
Hi Rik,
There are 32 physcial cores and 64 threads. Here is the result from entering ver -support
-----------------------------------------------------------------------------------------------------
MATLAB Version: 9.8.0.1323502 (R2020a)
MATLAB License Number: ___
Operating System: Microsoft Windows Server 2019 Datacenter Version 10.0 (Build 17763)
Java Version: Java 1.8.0_202-b08 with Oracle Corporation Java HotSpot(TM) 64-Bit Server VM mixed mode
-----------------------------------------------------------------------------------------------------
MATLAB Version 9.8 (R2020a) License __
Image Processing Toolbox Version 11.1 (R2020a) License __
MATLAB Compiler Version 8.0 (R2020a) License __
Parallel Computing Toolbox Version 7.2 (R2020a) License __
Rik
Rik 2020 年 6 月 30 日
I'm not sure what people could do with it, but I think I would redact that license number. I'm on mobile now, so it's a pain to edit it away for you.

サインインしてコメントする。

採用された回答

Edric Ellis
Edric Ellis 2020 年 7 月 1 日
The main difference between parfor and parfeval is that in the parfeval case, you are responsible for scheduling the work on the workers. parfor has an advantage over parfeval in that it knows how many loop iterations there are, and so what it does is schedule a fixed number of chunks of work per worker (see the documentation for parforOptions - the chunks are referred to as "sub-ranges"). So, in your case, parfeval will incur more overhead since each parfeval request is sent on its own to a worker, where as parfor groups things together, and this will generally be more efficient in the case where the request durations are of a similar duration to the overheads of making a single remote request.
So, parfeval doesn't have a fundamental limitation in this regard, but you might need to amalgamate your requests if they are too short to match parfor performance. Another option might be to use parfor together with DataQueue which would let you perform updates at the client after each parfor iteration completes.

  8 件のコメント

表示 5 件の古いコメント
Edric Ellis
Edric Ellis 2020 年 7 月 8 日
As N exceeds the number of workers in the pool, the parfor machinery breaks the loop up into "subranges" to send to the workers. So, each worker will get a subrange of a number of loop iterations. objList here is a cell array, so I don't see how elements of that can interact with each other unless they're created that way. The other suspect here is the "broadcast" variable W. This will get copied once to each worker, and then the same instance will be used for multiple "subranges". So, if this has handle behaviour, that might explain it. Here's the sort of thing I'm thinking of.
h = containers.Map();
out = cell(1, 10);
parfor i = 1:10
% Following line is to fool parfor into letting me modify 'h'
hh = h;
% Check the current contents of 'h'
out{i} = hh.keys();
% Modify 'h'
hh(string(i)) = magic(i);
end
Joseph Smalley
Joseph Smalley 2020 年 7 月 13 日
Edric, after taking a break and returning to this problem, your last above recommended code still behaves in the non-physical way from my previous comment. The disp() line within my parfor loop (code below) checks a property of both the W_temp and W object. All values of "Src Intensity" should be equal to ~1. However after exceeding a multiple of the number of workers in my pool (6), the properties experience a "step-like" behavior in that one of the workers sees an updated object rather than the original object (before the parfor loop begins). Below shows Src Intensity jumping to 2 on the 7th worker and 3 on the 13th worker.
N=7
Iteration #3 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #5 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #4 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #2 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #1 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #7 of 7 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #6 of 7 complete. Src Intensity(temp)=2.0002, Src Intensity(main)=2.0002
----
N=13
Iteration #2 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #3 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #5 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #4 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #1 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #6 of 13 complete. Src Intensity(temp)=1, Src Intensity(main)=1
Iteration #8 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #11 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #10 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #9 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #7 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #13 of 13 complete. Src Intensity(temp)=2, Src Intensity(main)=2
Iteration #12 of 13 complete. Src Intensity(temp)=3.0001, Src Intensity(main)=3.0001
It then appears that W is updated, WITHIN the parfor loop, after a completed cycle of 6 workers. However upon completion of the parfor loop, only the W_temp object is updated. Hence I need to "manually" update the properties of W with a serial for loop, which is OK. The problem is that I do not want W_temp or W to be updated within the parfor loop after completion of a multiple of the size of the parallel pool. I want all workers to see the original W object for all iterations. Is this possible? Thank you for your continued assistance.
% pre-allocation
maxSegPerRay = W.maxSegments*W.maxBranches;
rayListAll_origin = zeros(3,N,maxSegPerRay);
rayList_length = zeros(N,1);
% W is the handle object whose properties include other handle classes that we want to update
W_temp(N,1) = W;
for i=1:N
W_temp(i) = W;
end
parfor i=1:N
% Convert broadcast variable into temporary variable
W_temp2 = W;
% Call main function
[~,rayList] = IterTrace_oneParent_par(W_temp2,inputRays(i));
% Convert updated rayList properties to numeric array (not a problem)
rayList_length(i) = length(rayList);
zeroList_length = maxSegPerRay - rayList_length(i);
rayListAll_origin(:,i,:) = [rayList.origin, zeros(3,zeroList_length)];
% Update W_temp object and display Src Intensity of W_temp and W (should always be ~1)
W_temp(i) = W_temp2;
disp(['Iteration #', num2str(i), ' of ', num2str(N) ' complete. Src Intensity(temp)=', num2str(sum([W_temp(i).objects{5}.rays.intensity])),...
', Src Intensity(main)=', num2str(sum([W.objects{5}.rays.intensity]))]);
end
% Note: W_temp and W are both updated WITHIN the parfor loop after subRange is complete, but only W_temp is updated on completion of the parfor loop
% "Manually" update properties of detector objects contained in W
for i=1:N
for j=1:W.numObj
if class(W.objects{j})=="Detector"
if ~isempty(W_temp(i).objects{j}.rays)
W.objects{j}.rays = W_temp(i).objects{j}.rays;
end
end
end
end
Joseph Smalley
Joseph Smalley 2020 年 8 月 11 日
Just wanted to say that I accepted this answer because, overall, the problem is addressed more easily by switching to a parfor loop, as Edric first proposed. Additionally I switched all my classes to value classes over handle classes. The latter is a compromise for my application, and was first motivated by requirements of codegen for MEX files. Nonetheless the combination of parfor with values classes has been working for several weeks now, with pretty good scalability of 12x at 24 cores.

サインインしてコメントする。

その他の回答 (0 件)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by