Is vectorized code always faster than loops? Any exceptions?

Question

0 votes

[EDIT: 20110727 09:35 CDT - reformat - WDR]

I have a critical chunk of a code that has six nested for-loops. I reduced the innermost three with vectorization and I see that the vectorized version (with exact same config of everything else and same computer) takes twice the run time. I ran each of them a few times and here are the results. Any light on understanding this behaviour is appreciated. Thanks.

% fem_nought is file with loops. Fem_optimised is one with the vectorized equivalent of the innermost 3 loops.
>>fem_optimized
Elapsed time is 10.073242 seconds.
>> fem_optimized
Elapsed time is 9.588474 seconds.
>> fem_optimized
Elapsed time is 9.872822 seconds.
>> fem_nought
Elapsed time is 4.047568 seconds.
>> fem_nought
Elapsed time is 3.678311 seconds.
>> fem_nought
Elapsed time is 3.672811 seconds.

Trimmed versions of both the codes are below: (decl of a lot of variables are removed)

LOOPS version:

for k=1:nel
      for ri=1:8
          for si=1:8
              for mn=1:4
                  for nm=1:4
                      for km=1:4
                          r=.5*(a*p(mn)+r1+r2);
                          s=.5*(b*p(nm)+s3+s2);
                          t=.5*(c*p(km)+t1+t5);
                          a1=-.02*s+0.5*r*(1-r^2)+.05*t;
                          a2=-.05*t-.5*s;
                            %...............SHAPE FUNCTUION..........................
                            N(1)=((r-r2)/(r1-r2))*((s-s4)/(s1-s4))*((t-t5)/(t1-t5));
                            N(2)=((r-r1)/(r2-r1))*((s-s3)/(s2-s3))*((t-t6)/(t2-t6));
                            N(3)=((r-r4)/(r3-r4))*((s-s2)/(s3-s2))*((t-t7)/(t3-t7));
                            N(4)=((r-r3)/(r4-r3))*((s-s1)/(s4-s1))*((t-t8)/(t4-t8));
                            N(5)=((r-r6)/(r5-r6))*((s-s8)/(s5-s8))*((t-t1)/(t5-t1));
                            N(6)=((r-r5)/(r6-r5))*((s-s7)/(s6-s7))*((t-t2)/(t6-t2));
                            N(7)=((r-r8)/(r7-r8))*((s-s6)/(s7-s6))*((t-t3)/(t7-t3));
                            N(8)=((r-r7)/(r8-r7))*((s-s5)/(s8-s5))*((t-t4)/(t8-t4));
                            Nr(1)=(1/(r1-r2))*((s-s4)/(s1-s4))*((t-t5)/(t1-t5));
                            Nr(2)=(1/(r2-r1))*((s-s3)/(s2-s3))*((t-t6)/(t2-t6));
                            Nr(3)=(1/(r3-r4))*((s-s2)/(s3-s2))*((t-t7)/(t3-t7));
                            Nr(4)=(1/(r4-r3))*((s-s1)/(s4-s1))*((t-t8)/(t4-t8));
                            Nr(5)=(1/(r5-r6))*((s-s8)/(s5-s8))*((t-t1)/(t5-t1));
                            Nr(6)=(1/(r6-r5))*((s-s7)/(s6-s7))*((t-t2)/(t6-t2));
                            Nr(7)=(1/(r7-r8))*((s-s6)/(s7-s6))*((t-t3)/(t7-t3));
                            Nr(8)=(1/(r8-r7))*((s-s5)/(s8-s5))*((t-t4)/(t8-t4));
                            Ns(1)=((r-r2)/(r1-r2))*(1/(s1-s4))*((t-t5)/(t1-t5));
                            Ns(2)=((r-r1)/(r2-r1))*(1/(s2-s3))*((t-t6)/(t2-t6));
                            Ns(3)=((r-r4)/(r3-r4))*(1/(s3-s2))*((t-t7)/(t3-t7));
                            Ns(4)=((r-r3)/(r4-r3))*(1/(s4-s1))*((t-t8)/(t4-t8));
                            Ns(5)=((r-r6)/(r5-r6))*(1/(s5-s8))*((t-t1)/(t5-t1));
                            Ns(6)=((r-r5)/(r6-r5))*(1/(s6-s7))*((t-t2)/(t6-t2));
                            Ns(7)=((r-r8)/(r7-r8))*(1/(s7-s6))*((t-t3)/(t7-t3));
                            Ns(8)=((r-r7)/(r8-r7))*(1/(s8-s5))*((t-t4)/(t8-t4));
                            Nt(1)=((r-r2)/(r1-r2))*((s-s4)/(s1-s4))*(1/(t1-t5));
                            Nt(2)=((r-r1)/(r2-r1))*((s-s3)/(s2-s3))*(1/(t2-t6));
                            Nt(3)=((r-r4)/(r3-r4))*((s-s2)/(s3-s2))*(1/(t3-t7));
                            Nt(4)=((r-r3)/(r4-r3))*((s-s1)/(s4-s1))*(1/(t4-t8));
                            Nt(5)=((r-r6)/(r5-r6))*((s-s8)/(s5-s8))*(1/(t5-t1));
                            Nt(6)=((r-r5)/(r6-r5))*((s-s7)/(s6-s7))*(1/(t6-t2));
                            Nt(7)=((r-r8)/(r7-r8))*((s-s6)/(s7-s6))*(1/(t7-t3));
                            Nt(8)=((r-r7)/(r8-r7))*((s-s5)/(s8-s5))*(1/(t8-t4));
                           p1(ri,si,k)=a1*N(ri)*Ns(si)*w(mn)*w(nm)*w(km)*.125*a*b*c;
                            p2(ri,si,k)=a2*N(ri)*Nt(si)*w(mn)*w(nm)*w(km)*.125*a*b*c;
                            %Elemental Stiffness Matrix......................
                            ke(ri,si,k) = ke(ri,si,k) + p1(ri,si,k) + p2(ri,si,k);
                        end
                    end
                end
            end
        end
    end

VECTORIZED VERSION

for k=1:nel
      r=.5*(a*p(mn)+r1+r2);
      s=.5*(b*p(nm)+s3+s2);
      t=.5*(c*p(km)+t1+t5);
      Nr = zeros(4,4,4,8);
      N = zeros(4,4,4,8);
      Ns = zeros(4,4,4,8);
      Nt = zeros(4,4,4,8);
      for ri=1:8
          for si=1:8
              %...............SHAPE FUNCTUION..........................
              Nr(:,:,:,1)=(1/(r1-r2))*((s-s4)/(s1-s4)).*((t-t5)/(t1-t5));
              Nr(:,:,:,2)=(1/(r2-r1))*((s-s3)/(s2-s3)).*((t-t6)/(t2-t6));
              Nr(:,:,:,3)=(1/(r3-r4))*((s-s2)/(s3-s2)).*((t-t7)/(t3-t7));
              Nr(:,:,:,4)=(1/(r4-r3))*((s-s1)/(s4-s1)).*((t-t8)/(t4-t8));
              Nr(:,:,:,5)=(1/(r5-r6))*((s-s8)/(s5-s8)).*((t-t1)/(t5-t1));
              Nr(:,:,:,6)=(1/(r6-r5))*((s-s7)/(s6-s7)).*((t-t2)/(t6-t2));
              Nr(:,:,:,7)=(1/(r7-r8))*((s-s6)/(s7-s6)).*((t-t3)/(t7-t3));
              Nr(:,:,:,8)=(1/(r8-r7))*((s-s5)/(s8-s5)).*((t-t4)/(t8-t4));
              N(:,:,:,1) = (r-r2).*Nr(:,:,:,1);
              N(:,:,:,2) = (r-r1).*Nr(:,:,:,2);
              N(:,:,:,3) = (r-r4).*Nr(:,:,:,3);
              N(:,:,:,4) = (r-r3).*Nr(:,:,:,4);
              N(:,:,:,5) = (r-r6).*Nr(:,:,:,5);
              N(:,:,:,6) = (r-r5).*Nr(:,:,:,6);
              N(:,:,:,7) = (r-r8).*Nr(:,:,:,7);
              N(:,:,:,8) = (r-r7).*Nr(:,:,:,8);
              Ns(:,:,:,1) = N(:,:,:,1)./(s-s4);
              Ns(:,:,:,2) = N(:,:,:,2)./(s-s3);
              Ns(:,:,:,3) = N(:,:,:,3)./(s-s2);
              Ns(:,:,:,4) = N(:,:,:,4)./(s-s1);
              Ns(:,:,:,5) = N(:,:,:,5)./(s-s8);
              Ns(:,:,:,6) = N(:,:,:,6)./(s-s7);
              Ns(:,:,:,7) = N(:,:,:,7)./(s-s6);
              Ns(:,:,:,8) = N(:,:,:,8)./(s-s5);
              Nt(:,:,:,1) = N(:,:,:,1)./(t-t5);
              Nt(:,:,:,2) = N(:,:,:,2)./(t-t6);
              Nt(:,:,:,3) = N(:,:,:,3)./(t-t7);
              Nt(:,:,:,4) = N(:,:,:,4)./(t-t8);
              Nt(:,:,:,5) = N(:,:,:,5)./(t-t1);
              Nt(:,:,:,6) = N(:,:,:,6)./(t-t2);
              Nt(:,:,:,7) = N(:,:,:,7)./(t-t3);
              Nt(:,:,:,8) = N(:,:,:,8)./(t-t4);
              kem = .125*a*b*c * N(:,:,:,ri).*w(mn).*w(nm).*w(km) ...
                  .* (  (-.02*s+0.5*r.*(1-r.^2)+.05*t).*Ns(:,:,:,si) ...
                  +  (-.05*t-.5*s).*Nt(:,:,:,si));
              ke(ri,si,k) = sum(kem(:));
              %             
          end
      end
  end

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Jan on 27 Jul 2011

1 vote

No, vectorized code is not always faster. If the vectorization needs the creation of large temporary arrays, loops are often faster. The allocation of memory is very expensive, because it can cause a garbage collection or even disk swapping.

An example: http://www.mathworks.com/matlabcentral/answers/8461-double-summation-with-vectorized-loops

BTW: Because Nr, N, Ns and Nt are completely overwritten in each iteration. Therefore it is enough and more efficient to allocate them once before the loops.

1 Comment
Show -1 older comments Hide -1 older comments

cr on 27 Jul 2011

Thanks for your BTW comment. I overlooked that N* was unnecessarily inside the inner loops.

Sign in to comment.

Answer 2

Daniel Shub on 27 Jul 2011

0 votes

I am not sure if vectorization is always faster, but loops are not as expensive as they used to be, thanks to the JIT accelerator. I would guess there might be examples were loops are faster, but I cannot think of one off the top of my head.

2 Comments
Show None Hide None

cr on 27 Jul 2011

Can you please throw some light on JIT and since when it existed?

Daniel Shub on 27 Jul 2011

I am not the best person to answer that. I would suggest asking it as a new question to get a good answer.

Sign in to comment.

Answer 3

cr on 27 Jul 2011

0 votes

See my comment accepted answer by Jan Simon. The code I pasted above ran on 4 machines - 3 pcs (R2010a & R2007b) and a mac(R2010a). Two PCs (one R2010a & one R2007b) and the mac took longer with vectorized code (9sec vs 5sec). One PC (R2007b), strangely though, consistently took 5s for vectorized code and 29s for loops. I'm at wits end trying to interpret this now.

With the correction as in the comment mentioned above, the code takes just 1s.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Is vectorized code always faster than loops? Any exceptions?

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment
Show -1 older comments Hide -1 older comments

More Answers (2)

2 Comments
Show None Hide None

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Products

Tags

Community Treasure Hunt

Is vectorized code always faster than loops? Any exceptions?

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment Show -1 older comments Hide -1 older comments

More Answers (2)

2 Comments Show None Hide None

0 Comments Show -2 older comments Hide -2 older comments

Categories

Products

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

1 Comment
Show -1 older comments Hide -1 older comments

2 Comments
Show None Hide None

0 Comments
Show -2 older comments Hide -2 older comments