Double summation with vectorized loops

Hi. I want to vectorize this double for loop because it is a bottleneck in my code. Since MATLAB is a based-one indexing language I have to create an additional term for M = 0.
R,r,lambda,phi,mu_p are constants
Slm(L,M),Clm(L,M) are matrices 70x70
Plm(L,M) is a matrix 70x71
Cl(L),Pl(L) are vectors 70x1
% function dU_r
s1 = 0;
for L = 2:70
s1 = s1 + ((R/r)^L)*(L+1)*Pl(L)*Cl(L);
for M = 1:L
s1 = s1 + ((R/r)^L)*(L+1)*Plm(L,M)*(Clm(L,M)*...
cos(M*lambda) + Slm(L,M)*sin(M*lambda));
end
end
dU_r = -(mu_p/(r^2))*s1;
% function dU_phi
s2=0;
for L = 2:70
s2 = s2 + ((R/r)^L))*Plm(L,1)*Cl(L);
for M = 1:L
s2 = s2 + ((R/r)^L)*(Plm(L,M+1)-M*tan(phi)*Plm(L,M))*...
(Clm(L,M)*cos(M*lambda) + Slm(L,M)*sin(M*lambda));
end;
end;
dU_phi = (mu_p/r)*s2;
% function dU_lambda
s3=0;
for L=2:70
for M=1:L
s3 = s3 + ((R/r)^L)*M*Plm(L,M)*(Slm(L,M)*cos(M*lambda)...
- Clm(L,M)*sin(M*lambda));
end;
dU_lambda = (mu_p/r)*s3;

3 Comments

Plm(:, 71) is not used anywhere - is this correct?
@Jan Simon: This term is used inside another for loop. I had not written all code because the remaining for loops are similar but, finally, I have decided to do it. Thank you very much for your help.
@Julian: Now you have expanded your code. Following my example you can calcluate the three sums simultaneously.
The general rule is: Move all repeated claculations out of the loop. E.g. for dU_phi
you calculate TAN(phi) 2484 times, although it is constant.

Sign in to comment.

 Accepted Answer

Let's solve the the inner loop at first (I prefer "j", because the lower case "L" looks like a one):
R = rand; r = rand; lambda = rand;
Slm = rand(70); Clm = rand(70);
Plm = rand(70, 71);
Cl = rand(70, 1); Pl = rand(70, 1);
s = 0;
for j = 2:70
s = s + ((R/r)^j) * (j+1) * Pl(j) * Cl(j) + ...
sum(((R/r)^j) * (j+1) * Plm(L,1:j) .* ...
(Clm(L, 1:j) .* cos((1:j) * lambda) + ...
Slm(L, 1:j) .* sin((1:j) * lambda)));
end
But this is 4 times slower than the original version under Matlab 2009a!
Let's try to avoid the repeated power, COS and SIN:
Rr = R / r;
RrL = RrL; % EDITED: No cumprod anymore -> 5% faster
cosLambda = cos((1:70)* lambda);
sinLambda = sin((1:70)* lambda);
u1 = uint8(1);
s = 0;
for j = uint8(2):uint8(70)
RrL = RrL * Rr;
q = RrL * (double(j) + 1);
t = Pl(j) * Cl(j);
for m = u1:j
t = t + Plm(j,m) * ...
(Clm(j, m) * cosLambda(m) + ...
Slm(j, m) * sinLambda(m));
end
s = s + q * t;
end
EDITED: 40% faster with UINT8 loop indices instead of DOUBLEs! Same speed for INT32, but only 25% for UINT32.
This is 12 times faster than the original version - with FOR loops!
So vectorized does not necessarily mean faster. The JIT acceleration introduced with Matlab 6.5 increases the speed of this loop remarkably. And avoiding powers and trigonometric calculations is important also.
The old tale of the slow FOR loops is very sticky.

11 Comments

@Jan Simon: Thank you very much for your quick answer. I would like to know if there is some way to vectorize your method without using for loop.
Please explain why do you want a vectorized method. The vectorization of the inner loop is 15 times slower than the simplified double FOR loop already (with the SIN/COS/POWER moved outside the loops also), because the creation of the intermediate vectors consumes a lot of time. I'm sure, a full vectorization will be slower again. Beside the creation of the large intermediate array, the consideration of the triangular shape will be inefficient also.
If you want more speed, create a C-Mex file.
+1
I wonder how long it will take this automatic assumption to disappear? It has already been 6 years or so since 6.5, but perhaps by the 10 year mark...
Well, try timing Matt's version. Depending partly on your MATLAB version, you might find that vectorizing is slower for your purposes. Vectorizing usually comes at the cost of more memory overhead, and that can be much more expensive than looping.
@Jan Simon: I am trying to propagate the orbit of a satellite. For 0.1 seconds of propagation, my program takes 17 seconds long (with your solution) to give me the result. I am using the Matlab profiler. It says me that the bottleneck is inside the inner for loop so I think I should vectorize it.
@Walter Roberson: Thank you very much for your suggestion and comment. Cheers.
@Matt: Matlab 6.5 was released in July 2002.
The power of the JIT acceleration is revealed, if the debugger is enabled, e.g. by setting a breakpoint anywhere in the code: Suddenly the vectorized version is the fastest! But still 33 times slower than the simple double FOR loops.
@Julian: Care for disabled debugging during to run the PROFILEr.
@Jan Simon: I do not understand what you want to say. When I run the profiler, the debugger is always disabled (there is no breakpoint in my code). Should I have the debugger disabled then or enabled during the running of the profiler?
@Julian: Everything is ok, if there are no break points (empty output from DBSTATUS).
Be aware, that the profiler slows down the JIT-accelerated loops more than other parts of the code! A workaround is using TIC TOC around the whole program measure twice with and without duplicating the investigated section. This is far from being comfortable, but avoid the disproportional PROFILE overhead.
56819066_381115406071851_6045156277162606592_n.png
how to write second and third equation for image with matlab code plz?
Please start a new Question for this.

Sign in to comment.

More Answers (2)

Here is a vectorized version, but I must say that I would have probably just went with Jan's FOR loop. It seems you need to see a vectorization, even if it is not the most efficient solution. Here you go:
V = ((R/r).^(2:70));
s = repmat(1:70,69,1);
s = sum(V(1:69).*(3:71).*sum(tril(Plm(2:70,1:70).*...
(Clm(2:70,1:70).*cos(s*lambda)+Slm(2:70,1:70).*...
sin(s*lambda)),1),2).'+V.*(3:71).*Pl(2:70).'.*Cl(2:70).');

11 Comments

@Matt Fig: Thank you very much for your answer. I am a newbie with Matlab. I have been researching on internet about profiling, optimization and speed up the Matlab code. The documents that I found adviced to vectorize loops so I thought this way is the best. It looks like I am wrong but, as I mentioned above, the solution without a full vectorization does not work with my program as I would like it to do.
Timings (Matlab 2009a, single core), 100 repetitions:
Orig: 0.136 sec
Matt: 0.162 sec
Jan1: 0.635 sec (inner loop vectorized)
Jan2: 0.022 sec (simplified FOR loops, DOUBLE counters)
Jan3: 0.013 sec (simplified FOR loops, UINT8 counters)
@Julian: I get similar timing results to Jan, which is why I said I would have just gone with the optimized FOR loops...
You are not the first to fall for the myth of the infallible vectorization paradigm ;-) In fact, I went ahead and wrote the above vectorization to use as an example to help convince future victims!
@Jan Simon: Thank you for the time spent measuring timing. It is a good practice that I have learnt and I will use for the next time. Cheers.
@Matt: Therefore I've voted Julian's question: It is really a nice example.
I'm deeply surprised about the 40% speed gain, if UINT8 or INT8 indices are used. And even less expected is the reduction to 25% when using UINT32, while INT32 leads to 40% also. I think integer indices might have big potential in a lot of functions. I'll start some tests, e.g. with the lame INTERP1.
Jan just to get this into my head clear and straight:
Loop with UINT8, INT8 and INT32 -> 40% speed gain,
Loop with UINT32 -> 75% speed gain?
This means having to go through much code to find where it's worth replacing flints to UINT32.
@Bjorn: Nope. I meant "reduction *by* 25%", not "to 25%". Sorry.
"for i=1:j" with DOUBLEs: 100% runtime
"for i=uint8(1):uint8(j)": 60%
"for i=int16(1):int16(j)": 60%
"for i=int32(1):int32(j)": 60%
"for i=uint32(1):uint32(j)": 75% (slower!)
I'd assume, that UINT32 indices are optimal on a 32 bit system. But there could be an implicite conversion to a DOUBLE, which is performed in the processor for INT32, but in software for UINT32.
@Bjorn: Nope. I meant "reduction *by* 25%", not "to 25%". Sorry.
"for i=1:j" with DOUBLEs: 100% runtime
"for i=uint8(1):uint8(j)": 60%
"for i=int16(1):int16(j)": 60%
"for i=int32(1):int32(j)": 60%
"for i=uint32(1):uint32(j)": 75% (slower!)
I'd assume, that UINT32 indices are optimal on a 32 bit system. But there could be an implicite conversion to a DOUBLE, which is performed in the processor for INT32, but in software for UINT32.
I'm not sure if these measurements have any general meaning. Further investigations are needed.
@Jan, I thought we had discussed using int family as indices before? Perhaps in an old NG thread... can't remember. But anyway, yes it is true that using the int family as indices can speed things up oftentimes. The word of caution, which I know you know, but I mention for anyone reading this thread, is to watch out for overflow. For example:
A = zeros(100000,2) % Pre-allocate
for ii = 1:int8(size(A,1)) % Will run faster than expected!!
% Do something with A
end
@Matt: Yes, we have discussed this before. But when I look into to source of Matlab's toolbox functions, e.g. MAT2CELL, it does not look like we have impressed the TMW developers very much - at least until 2009.
I like the way you make Matlab run "faster than expected". The consequent avoiding of work is real efficiency. :-)
The OP has different timings, because his Pl and Plm are sparse. But for the full Pl and >50% full Plm sparsity wastes time by impeding the JIT.
@Jan, you got my hopes up with that 25%! (On the other hand maybe it was just as well that it was "by 25%" - saves me a whole lot of dreary rewriting. But with the speedup we're getting closer to the practices of "normal" programming - having to make sure that we use the optimal type for each variable...)

Sign in to comment.

Hi. Taking like reference the solution Jan3, given by Jan Simon, I have written the following code for my problem. Thank all that have collaborated with comments and suggestions.
Rr = R/r;
RrL = Rr;
cosLambda = cos((1:70)* lambda);
sinLambda = sin((1:70)* lambda);
u1 = uint8(1);
s1 = 0;
s2 = 0;
s3 = 0;
for j = uint8(2):uint8(70)
RrL = RrL * Rr;
q1 = RrL * (double(j) + 1);
t1 = Pl(j) * datos.Cl(j);
q2 = RrL;
t2 = Plm(j,1) * Cl(j);
t3 = 0;
for m = u1:j
t1 = t1 + Plm(j,m) * ...
(Clm(j, m) * cosLambda(m) + ...
Slm(j, m) * sinLambda(m));
t2 = t2 + (Plm(j,m+1)-double(m)*tan_phi*Plm(j,m))*...
(Clm(j,m)*cosLambda(m) + Slm(j,m)*sinLambda(m));
t3 = t3 + double(m)*Plm(j,m)*(Slm(j,m)*cosLambda(m)...
- Clm(j,m)*sinLambda(m));
end
s1 = s1 + q1 * t1;
s2 = s2 + q2 * t2;
s3 = s3 + q2 * t3;
end
dU_r = -(mu_p/(r^2))*s1;
dU_phi = (mu_p/r)*s2;
dU_lambda = (mu_p/r)*s3;

Categories

Find more on Loops and Conditional Statements in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!