Is vectorized code always faster than loops? Any exceptions?

6 ビュー (過去 30 日間)
cr
cr 2011 年 7 月 27 日
[EDIT: 20110727 09:35 CDT - reformat - WDR]
I have a critical chunk of a code that has six nested for-loops. I reduced the innermost three with vectorization and I see that the vectorized version (with exact same config of everything else and same computer) takes twice the run time. I ran each of them a few times and here are the results. Any light on understanding this behaviour is appreciated. Thanks.
% fem_nought is file with loops. Fem_optimised is one with the vectorized equivalent of the innermost 3 loops.
>>fem_optimized
Elapsed time is 10.073242 seconds.
>> fem_optimized
Elapsed time is 9.588474 seconds.
>> fem_optimized
Elapsed time is 9.872822 seconds.
>> fem_nought
Elapsed time is 4.047568 seconds.
>> fem_nought
Elapsed time is 3.678311 seconds.
>> fem_nought
Elapsed time is 3.672811 seconds.
Trimmed versions of both the codes are below: (decl of a lot of variables are removed)
LOOPS version:
for k=1:nel
for ri=1:8
for si=1:8
for mn=1:4
for nm=1:4
for km=1:4
r=.5*(a*p(mn)+r1+r2);
s=.5*(b*p(nm)+s3+s2);
t=.5*(c*p(km)+t1+t5);
a1=-.02*s+0.5*r*(1-r^2)+.05*t;
a2=-.05*t-.5*s;
%...............SHAPE FUNCTUION..........................
N(1)=((r-r2)/(r1-r2))*((s-s4)/(s1-s4))*((t-t5)/(t1-t5));
N(2)=((r-r1)/(r2-r1))*((s-s3)/(s2-s3))*((t-t6)/(t2-t6));
N(3)=((r-r4)/(r3-r4))*((s-s2)/(s3-s2))*((t-t7)/(t3-t7));
N(4)=((r-r3)/(r4-r3))*((s-s1)/(s4-s1))*((t-t8)/(t4-t8));
N(5)=((r-r6)/(r5-r6))*((s-s8)/(s5-s8))*((t-t1)/(t5-t1));
N(6)=((r-r5)/(r6-r5))*((s-s7)/(s6-s7))*((t-t2)/(t6-t2));
N(7)=((r-r8)/(r7-r8))*((s-s6)/(s7-s6))*((t-t3)/(t7-t3));
N(8)=((r-r7)/(r8-r7))*((s-s5)/(s8-s5))*((t-t4)/(t8-t4));
Nr(1)=(1/(r1-r2))*((s-s4)/(s1-s4))*((t-t5)/(t1-t5));
Nr(2)=(1/(r2-r1))*((s-s3)/(s2-s3))*((t-t6)/(t2-t6));
Nr(3)=(1/(r3-r4))*((s-s2)/(s3-s2))*((t-t7)/(t3-t7));
Nr(4)=(1/(r4-r3))*((s-s1)/(s4-s1))*((t-t8)/(t4-t8));
Nr(5)=(1/(r5-r6))*((s-s8)/(s5-s8))*((t-t1)/(t5-t1));
Nr(6)=(1/(r6-r5))*((s-s7)/(s6-s7))*((t-t2)/(t6-t2));
Nr(7)=(1/(r7-r8))*((s-s6)/(s7-s6))*((t-t3)/(t7-t3));
Nr(8)=(1/(r8-r7))*((s-s5)/(s8-s5))*((t-t4)/(t8-t4));
Ns(1)=((r-r2)/(r1-r2))*(1/(s1-s4))*((t-t5)/(t1-t5));
Ns(2)=((r-r1)/(r2-r1))*(1/(s2-s3))*((t-t6)/(t2-t6));
Ns(3)=((r-r4)/(r3-r4))*(1/(s3-s2))*((t-t7)/(t3-t7));
Ns(4)=((r-r3)/(r4-r3))*(1/(s4-s1))*((t-t8)/(t4-t8));
Ns(5)=((r-r6)/(r5-r6))*(1/(s5-s8))*((t-t1)/(t5-t1));
Ns(6)=((r-r5)/(r6-r5))*(1/(s6-s7))*((t-t2)/(t6-t2));
Ns(7)=((r-r8)/(r7-r8))*(1/(s7-s6))*((t-t3)/(t7-t3));
Ns(8)=((r-r7)/(r8-r7))*(1/(s8-s5))*((t-t4)/(t8-t4));
Nt(1)=((r-r2)/(r1-r2))*((s-s4)/(s1-s4))*(1/(t1-t5));
Nt(2)=((r-r1)/(r2-r1))*((s-s3)/(s2-s3))*(1/(t2-t6));
Nt(3)=((r-r4)/(r3-r4))*((s-s2)/(s3-s2))*(1/(t3-t7));
Nt(4)=((r-r3)/(r4-r3))*((s-s1)/(s4-s1))*(1/(t4-t8));
Nt(5)=((r-r6)/(r5-r6))*((s-s8)/(s5-s8))*(1/(t5-t1));
Nt(6)=((r-r5)/(r6-r5))*((s-s7)/(s6-s7))*(1/(t6-t2));
Nt(7)=((r-r8)/(r7-r8))*((s-s6)/(s7-s6))*(1/(t7-t3));
Nt(8)=((r-r7)/(r8-r7))*((s-s5)/(s8-s5))*(1/(t8-t4));
p1(ri,si,k)=a1*N(ri)*Ns(si)*w(mn)*w(nm)*w(km)*.125*a*b*c;
p2(ri,si,k)=a2*N(ri)*Nt(si)*w(mn)*w(nm)*w(km)*.125*a*b*c;
%Elemental Stiffness Matrix......................
ke(ri,si,k) = ke(ri,si,k) + p1(ri,si,k) + p2(ri,si,k);
end
end
end
end
end
end
VECTORIZED VERSION
for k=1:nel
r=.5*(a*p(mn)+r1+r2);
s=.5*(b*p(nm)+s3+s2);
t=.5*(c*p(km)+t1+t5);
Nr = zeros(4,4,4,8);
N = zeros(4,4,4,8);
Ns = zeros(4,4,4,8);
Nt = zeros(4,4,4,8);
for ri=1:8
for si=1:8
%...............SHAPE FUNCTUION..........................
Nr(:,:,:,1)=(1/(r1-r2))*((s-s4)/(s1-s4)).*((t-t5)/(t1-t5));
Nr(:,:,:,2)=(1/(r2-r1))*((s-s3)/(s2-s3)).*((t-t6)/(t2-t6));
Nr(:,:,:,3)=(1/(r3-r4))*((s-s2)/(s3-s2)).*((t-t7)/(t3-t7));
Nr(:,:,:,4)=(1/(r4-r3))*((s-s1)/(s4-s1)).*((t-t8)/(t4-t8));
Nr(:,:,:,5)=(1/(r5-r6))*((s-s8)/(s5-s8)).*((t-t1)/(t5-t1));
Nr(:,:,:,6)=(1/(r6-r5))*((s-s7)/(s6-s7)).*((t-t2)/(t6-t2));
Nr(:,:,:,7)=(1/(r7-r8))*((s-s6)/(s7-s6)).*((t-t3)/(t7-t3));
Nr(:,:,:,8)=(1/(r8-r7))*((s-s5)/(s8-s5)).*((t-t4)/(t8-t4));
N(:,:,:,1) = (r-r2).*Nr(:,:,:,1);
N(:,:,:,2) = (r-r1).*Nr(:,:,:,2);
N(:,:,:,3) = (r-r4).*Nr(:,:,:,3);
N(:,:,:,4) = (r-r3).*Nr(:,:,:,4);
N(:,:,:,5) = (r-r6).*Nr(:,:,:,5);
N(:,:,:,6) = (r-r5).*Nr(:,:,:,6);
N(:,:,:,7) = (r-r8).*Nr(:,:,:,7);
N(:,:,:,8) = (r-r7).*Nr(:,:,:,8);
Ns(:,:,:,1) = N(:,:,:,1)./(s-s4);
Ns(:,:,:,2) = N(:,:,:,2)./(s-s3);
Ns(:,:,:,3) = N(:,:,:,3)./(s-s2);
Ns(:,:,:,4) = N(:,:,:,4)./(s-s1);
Ns(:,:,:,5) = N(:,:,:,5)./(s-s8);
Ns(:,:,:,6) = N(:,:,:,6)./(s-s7);
Ns(:,:,:,7) = N(:,:,:,7)./(s-s6);
Ns(:,:,:,8) = N(:,:,:,8)./(s-s5);
Nt(:,:,:,1) = N(:,:,:,1)./(t-t5);
Nt(:,:,:,2) = N(:,:,:,2)./(t-t6);
Nt(:,:,:,3) = N(:,:,:,3)./(t-t7);
Nt(:,:,:,4) = N(:,:,:,4)./(t-t8);
Nt(:,:,:,5) = N(:,:,:,5)./(t-t1);
Nt(:,:,:,6) = N(:,:,:,6)./(t-t2);
Nt(:,:,:,7) = N(:,:,:,7)./(t-t3);
Nt(:,:,:,8) = N(:,:,:,8)./(t-t4);
kem = .125*a*b*c * N(:,:,:,ri).*w(mn).*w(nm).*w(km) ...
.* ( (-.02*s+0.5*r.*(1-r.^2)+.05*t).*Ns(:,:,:,si) ...
+ (-.05*t-.5*s).*Nt(:,:,:,si));
ke(ri,si,k) = sum(kem(:));
%
end
end
end

採用された回答

Jan
Jan 2011 年 7 月 27 日
No, vectorized code is not always faster. If the vectorization needs the creation of large temporary arrays, loops are often faster. The allocation of memory is very expensive, because it can cause a garbage collection or even disk swapping.
BTW: Because Nr, N, Ns and Nt are completely overwritten in each iteration. Therefore it is enough and more efficient to allocate them once before the loops.
  1 件のコメント
cr
cr 2011 年 7 月 27 日
Thanks for your BTW comment. I overlooked that N* was unnecessarily inside the inner loops.

サインインしてコメントする。

その他の回答 (2 件)

Daniel Shub
Daniel Shub 2011 年 7 月 27 日
I am not sure if vectorization is always faster, but loops are not as expensive as they used to be, thanks to the JIT accelerator. I would guess there might be examples were loops are faster, but I cannot think of one off the top of my head.
  2 件のコメント
cr
cr 2011 年 7 月 27 日
Can you please throw some light on JIT and since when it existed?
Daniel Shub
Daniel Shub 2011 年 7 月 27 日
I am not the best person to answer that. I would suggest asking it as a new question to get a good answer.

サインインしてコメントする。


cr
cr 2011 年 7 月 27 日
See my comment accepted answer by Jan Simon. The code I pasted above ran on 4 machines - 3 pcs (R2010a & R2007b) and a mac(R2010a). Two PCs (one R2010a & one R2007b) and the mac took longer with vectorized code (9sec vs 5sec). One PC (R2007b), strangely though, consistently took 5s for vectorized code and 29s for loops. I'm at wits end trying to interpret this now.
With the correction as in the comment mentioned above, the code takes just 1s.

カテゴリ

Help Center および File ExchangeLoops and Conditional Statements についてさらに検索

製品

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by