Double summation with vectorized loops
Show older comments
Hi. I want to vectorize this double for loop because it is a bottleneck in my code. Since MATLAB is a based-one indexing language I have to create an additional term for M = 0.
R,r,lambda,phi,mu_p are constants
Slm(L,M),Clm(L,M) are matrices 70x70
Plm(L,M) is a matrix 70x71
Cl(L),Pl(L) are vectors 70x1
% function dU_r
s1 = 0;
for L = 2:70
s1 = s1 + ((R/r)^L)*(L+1)*Pl(L)*Cl(L);
for M = 1:L
s1 = s1 + ((R/r)^L)*(L+1)*Plm(L,M)*(Clm(L,M)*...
cos(M*lambda) + Slm(L,M)*sin(M*lambda));
end
end
dU_r = -(mu_p/(r^2))*s1;
% function dU_phi
s2=0;
for L = 2:70
s2 = s2 + ((R/r)^L))*Plm(L,1)*Cl(L);
for M = 1:L
s2 = s2 + ((R/r)^L)*(Plm(L,M+1)-M*tan(phi)*Plm(L,M))*...
(Clm(L,M)*cos(M*lambda) + Slm(L,M)*sin(M*lambda));
end;
end;
dU_phi = (mu_p/r)*s2;
% function dU_lambda
s3=0;
for L=2:70
for M=1:L
s3 = s3 + ((R/r)^L)*M*Plm(L,M)*(Slm(L,M)*cos(M*lambda)...
- Clm(L,M)*sin(M*lambda));
end;
dU_lambda = (mu_p/r)*s3;
3 Comments
Jan
on 31 May 2011
Plm(:, 71) is not used anywhere - is this correct?
Julián Francisco
on 31 May 2011
Jan
on 31 May 2011
@Julian: Now you have expanded your code. Following my example you can calcluate the three sums simultaneously.
The general rule is: Move all repeated claculations out of the loop. E.g. for dU_phi
you calculate TAN(phi) 2484 times, although it is constant.
Accepted Answer
More Answers (2)
Matt Fig
on 31 May 2011
Here is a vectorized version, but I must say that I would have probably just went with Jan's FOR loop. It seems you need to see a vectorization, even if it is not the most efficient solution. Here you go:
V = ((R/r).^(2:70));
s = repmat(1:70,69,1);
s = sum(V(1:69).*(3:71).*sum(tril(Plm(2:70,1:70).*...
(Clm(2:70,1:70).*cos(s*lambda)+Slm(2:70,1:70).*...
sin(s*lambda)),1),2).'+V.*(3:71).*Pl(2:70).'.*Cl(2:70).');
11 Comments
Julián Francisco
on 31 May 2011
Jan
on 31 May 2011
Timings (Matlab 2009a, single core), 100 repetitions:
Orig: 0.136 sec
Matt: 0.162 sec
Jan1: 0.635 sec (inner loop vectorized)
Jan2: 0.022 sec (simplified FOR loops, DOUBLE counters)
Jan3: 0.013 sec (simplified FOR loops, UINT8 counters)
Matt Fig
on 31 May 2011
@Julian: I get similar timing results to Jan, which is why I said I would have just gone with the optimized FOR loops...
You are not the first to fall for the myth of the infallible vectorization paradigm ;-) In fact, I went ahead and wrote the above vectorization to use as an example to help convince future victims!
Julián Francisco
on 31 May 2011
Jan
on 1 Jun 2011
@Matt: Therefore I've voted Julian's question: It is really a nice example.
I'm deeply surprised about the 40% speed gain, if UINT8 or INT8 indices are used. And even less expected is the reduction to 25% when using UINT32, while INT32 leads to 40% also. I think integer indices might have big potential in a lot of functions. I'll start some tests, e.g. with the lame INTERP1.
Bjorn Gustavsson
on 1 Jun 2011
Jan just to get this into my head clear and straight:
Loop with UINT8, INT8 and INT32 -> 40% speed gain,
Loop with UINT32 -> 75% speed gain?
This means having to go through much code to find where it's worth replacing flints to UINT32.
Jan
on 1 Jun 2011
@Bjorn: Nope. I meant "reduction *by* 25%", not "to 25%". Sorry.
"for i=1:j" with DOUBLEs: 100% runtime
"for i=uint8(1):uint8(j)": 60%
"for i=int16(1):int16(j)": 60%
"for i=int32(1):int32(j)": 60%
"for i=uint32(1):uint32(j)": 75% (slower!)
I'd assume, that UINT32 indices are optimal on a 32 bit system. But there could be an implicite conversion to a DOUBLE, which is performed in the processor for INT32, but in software for UINT32.
Jan
on 1 Jun 2011
@Bjorn: Nope. I meant "reduction *by* 25%", not "to 25%". Sorry.
"for i=1:j" with DOUBLEs: 100% runtime
"for i=uint8(1):uint8(j)": 60%
"for i=int16(1):int16(j)": 60%
"for i=int32(1):int32(j)": 60%
"for i=uint32(1):uint32(j)": 75% (slower!)
I'd assume, that UINT32 indices are optimal on a 32 bit system. But there could be an implicite conversion to a DOUBLE, which is performed in the processor for INT32, but in software for UINT32.
I'm not sure if these measurements have any general meaning. Further investigations are needed.
Matt Fig
on 1 Jun 2011
@Jan, I thought we had discussed using int family as indices before? Perhaps in an old NG thread... can't remember. But anyway, yes it is true that using the int family as indices can speed things up oftentimes. The word of caution, which I know you know, but I mention for anyone reading this thread, is to watch out for overflow. For example:
A = zeros(100000,2) % Pre-allocate
for ii = 1:int8(size(A,1)) % Will run faster than expected!!
% Do something with A
end
Jan
on 1 Jun 2011
@Matt: Yes, we have discussed this before. But when I look into to source of Matlab's toolbox functions, e.g. MAT2CELL, it does not look like we have impressed the TMW developers very much - at least until 2009.
I like the way you make Matlab run "faster than expected". The consequent avoiding of work is real efficiency. :-)
The OP has different timings, because his Pl and Plm are sparse. But for the full Pl and >50% full Plm sparsity wastes time by impeding the JIT.
Bjorn Gustavsson
on 2 Jun 2011
@Jan, you got my hopes up with that 25%! (On the other hand maybe it was just as well that it was "by 25%" - saves me a whole lot of dreary rewriting. But with the speedup we're getting closer to the practices of "normal" programming - having to make sure that we use the optimal type for each variable...)
Julián Francisco
on 1 Jun 2011
Categories
Find more on Loops and Conditional Statements in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!