Efficiency of for vs parfor loop for iterating a moderately large matrix operation
2 ビュー (過去 30 日間)
古いコメントを表示
Hi.
Running a program at the moment that takes several hours to complete. Roughly half of that time is spent on one line of code (repeated hundreds of thousands of times), which does multiplication, comparison, and addition of ~300x300 matrix (not sparse), and ~1x300 vectors.
I wanted to look into parallel-processing to speed up the program a bit. Technically the program is supposed to run linearly, where each iteration relies on output from the last, but I can do the for loop iterations in batches with the same initial values. However that's more of a follow-up question.
The question I have pertains to the results I'm getting with using a for loop, vs a parfor loop for essentially just executing this one line of code several thousand times.
The test code I have below runs two for-loops, and records the time to execute each forloop. The second for loop runs the mathematical process in one line. The first for loop runs the same algorithm, but separates it into four smaller operations, and then combining them together to complete the algorithm. Each forloop runs 10,000 times. This is repeated 20 times, and the mean time required for both for-loops is recorded. Then the program is run with "parfor" replacing the for loops, with the core limit set to 1,2,3,4, and 6.
The code, and results are below:
%%%%parfor test
%%%%figure out how the hell to use parfor
L_rate = .01;
cd = 1;
Vl0 = rand(1,256);
Hl0 = rand(1,301);
Vlcd = rand(1,257);
Hlcd = rand(1,301);
dwnet = zeros(257,301);
%%%calculation to perform:
%%%dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
for k = 1:20
tic
parfor (n=1:10000,1)
dwnet = zeros(257,301);
v1 = L_rate/sqrt(cd);
v2 = [Vl0,1]';
v3 = rand(size(Hl0))<Hl0;
v4 = Vlcd'*Hlcd;
dwnet = v1*(v2*v3-v4);
end
time1(k,1) = toc;
%toc;
tic
parfor (n=1:10000,1)
dwnet = zeros(257,301);
dwnet = L_rate/sqrt(cd)*([Vl0,1]'*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd);
end
time2(k,1) = toc;
%toc;
end
plot([time1,time2])
mean([time1,time2])
%%%%Time results
forloop_1 foorloop_2
%%%for 14.89sec 13.89sec
%%%parfor,1 9.95sec 10.98sec
%%%parfor,2 9.74sec 10.60sec
%%%parfor,3 9.91sec 10.90sec
%%%parfor,4 9.92sec 10.95sec
%%%parfor,6 9.93sec 10.93sec
%%%Run on Intel Core i7-Q720 (first-gen i-series; 4 cores, 8 logic threads)
So, as I expected, using a regular for-loop it is in fact faster to do everything with one line. This makes sense, as I'm adding time by writing v1 v2 v3 and v4 on top of ultimately doing the same calculation with dwnet.
But what fails to make sense to me is that when I use parfor, suddenly that becomes MORE efficient, by roughly 10%. Also, increasing the core-limit while using parfor actually seems to slightly slow the algorithm down - though that could just be a result of my poor understanding of parfor.
Question: Why is the first for-loop more efficient than the 2nd when using parfor? Feel free to expound upon parfor as much as you care to.
Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch). Any advice on how to do that with parfor? I need to get the cumulative sum of dwnets from each iteration to update Vl0 and Hl0, so I'd have to add up the dwnets for each worker, and then add them all up between the workers. But I'm fuzzy on how that's handled, and how parfor handles variables.
Any help answering this question is greatly appreciated.
0 件のコメント
採用された回答
Matt J
2014 年 6 月 12 日
編集済み: Matt J
2014 年 6 月 12 日
The statements
dwnet = zeros(257,301);
are unnecessary overhead and are clouding the issue. I re-implemented your code as below, removing all 3 occurrences of these statements and also handling v1 more efficiently. On my machine, the results of a plain for-loop are not significantly different. I expected no difference.
When running with parfor, I find that loop 1 is indeed about 25% faster than loop 2. I assume it's because parfor pre-parses the contents of the loop, analyzing which variables are temporary and which have other roles, see Classification of Variables. Because of this, I imagine parfor can pre-allocate temporary variables or optimize their use in some other obscure way.
v1 = L_rate/sqrt(cd);
Hlcd=v1*Hlcd;
for k = 1:5
tic
parfor (n=1:10000,2)
%for n=1:1e4
v2 = v1*[Vl0,1].';
v3 = rand(size(Hl0))<Hl0;
v4 = Vlcd'*Hlcd;
dwnet = v2*v3-v4;
end
time1(k,1) = toc;
%toc;
tic
parfor (n=1:10000,2)
%for n=1:1e4
dwnet = (v1*[Vl0,1]')*(rand(size(Hl0))<Hl0) - Vlcd'*Hlcd;
end
time2(k,1) = toc;
%toc;
end
Follow-up Question: Say that Vl0 and Hl0 rely on dwnet, and should be updated with every iteration, and Vlcd changes with each iteration. But I could leave Vl0 and Hl0 and dwnet at the same values for several iterations while changing Vlcd (say, 10-100 iterations per batch).
You would probably have to use a double for-loop
for i=1:N
parfor j=1:M
end
end
where the inner parfor loop loops over the batches over which Vl0 and Hl0 remain constant. It might not be worthwhile, though. Batches that small could probably be vectorized, circumventing a for-loop altogether. At least, your simplified example certain can be.
You also have to factor in communication to/from the labs. A normal parfor loop would have to broadcast its data to/from the client every time it begins and ends, so in your case at the beginning and end of every i-th iteration of the outer for-loop. It may help, however, to use Edric Ellis' Worker Object Wrapper, which can make data persist between parfor-loops.
2 件のコメント
Matt J
2014 年 6 月 13 日
編集済み: Matt J
2014 年 6 月 13 日
A side note: you actually can't implement v1 more efficiently... v4 needs to be scaled by v1.
Notice that I pre-scaled H1cd prior to the loop
Hlcd=v1*Hlcd;
This translates into the appropriate scaling on v4. Even if H1cd were changing inside the loop, it makes sense to pre-scale it there. To put it another way, you have a difference of two vector outer products. a*b'- c*d'. Scaling as
v1*(a*b'- c*d')
will require O(N^2) multiplies with v1 whereas scaling as
(v1*a)*b.' - (v1*c)*d'
will require only O(N).
That link you provided mentions a Reduction type variable within parfor, which is probably what I'm looking for. It would accumulate the sum of the dwnets generated each iteration
I'm very doubtful about whether this is the best approach, though my advice is of course clouded by not seeing your full true computation. In general, though, it is rather inefficient to be adding up outer products, or differences of outer products. Imagine for example that you had matrices V2 and V3 with respective dimensions 257x10000 and 301x10000 whose columns are the individual vectors v2 and v3 that you are generating within the parfor loop. Then the summation of all the v2*v3.' terms is equivalent to the single matrix multiplication V2*V3.', which MATLAB can do very fast, is already internally multi-threaded for you, and doesn't involve allocating memory for a 257x301 matrix 10000 times.
So, a better alternative for you might be to generate the columns V2(:,n) and V3(:,n) with parfor (see "Sliced Variables" in the link I gave you) and then post-multiply after the parfor loop. You would do the same thing with the summation of the outer products v4 and subtract the two results.
Last thing - any reason changing the number of max cores (parfor,2 vs parfor,4) didn't change the processing time significantly?
I haven't had a chance to test on a machine with more than 2 cores. I wonder if that would still be true with the operations
dwnet = zeros(257,301)
removed. Basically, to get the best advantage out of parallel cores, you want to be doing a lot more computation than memory transfer. Concatenation operations like [Vl0,1]' might be expensive. They force MATLAB to allocate memory of apriori unknown size and probably involve some RAM traffic. Since Vl0 is constant throughout the loop, you should probably do that concatenation prior to parfor. I speculate that function calls like zeros() and rand() might also slow things down. I wonder if it wouldn't be better to generate a 257x10000 random array prior to the parfor-loop and then broadcast it as a sliced variable.
I'll run a few tests, but I think I've got a good idea on this now. Thanks for your input.
OK. Make sure you're aware of what this button is meant for
and use if appropriate.
その他の回答 (0 件)
参考
カテゴリ
Help Center および File Exchange で Loops and Conditional Statements についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!