# vectorized code is more time consuming than a simple for loop

2 views (last 30 days)
hosein bashi on 18 Sep 2022
Edited: John D'Errico on 18 Sep 2022
Hi.
There is a simple "for loop" and I changed the loop to a vectorized computation hope to get a faster function but it's slower! any idea?
CODE:
p= 8.7390;
T= 791.6200;
a=tic;
for i=1:100000
h2_pT(p, T);
end
toc(a)
b=tic;
for i=1:100000
h2_pTnew(p, T);
end
toc(b)
function h2_pT = h2_pT(p, T)
Ir = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9, 10, 10, 10, 16, 16, 18, 20, 20, 20, 21, 22, 23, 24, 24, 24];
Jr = [0, 1, 2, 3, 6, 1, 2, 4, 7, 36, 0, 1, 3, 6, 35, 1, 2, 3, 7, 3, 16, 35, 0, 11, 25, 8, 36, 13, 4, 10, 14, 29, 50, 57, 20, 35, 48, 21, 53, 39, 26, 40, 58];
nr = [-1.7731742473213E-03, -0.017834862292358, -0.045996013696365, -0.057581259083432, -0.05032527872793, -3.3032641670203E-05, -1.8948987516315E-04, -3.9392777243355E-03, -0.043797295650573, -2.6674547914087E-05, 2.0481737692309E-08, 4.3870667284435E-07, -3.227767723857E-05, -1.5033924542148E-03, -0.040668253562649, -7.8847309559367E-10, 1.2790717852285E-08, 4.8225372718507E-07, 2.2922076337661E-06, -1.6714766451061E-11, -2.1171472321355E-03, -23.895741934104, -5.905956432427E-18, -1.2621808899101E-06, -0.038946842435739, 1.1256211360459E-11, -8.2311340897998, 1.9809712802088E-08, 1.0406965210174E-19, -1.0234747095929E-13, -1.0018179379511E-09, -8.0882908646985E-11, 0.10693031879409, -0.33662250574171, 8.9185845355421E-25, 3.0629316876232E-13, -4.2002467698208E-06, -5.9056029685639E-26, 3.7826947613457E-06, -1.2768608934681E-15, 7.3087610595061E-29, 5.5414715350778E-17, -9.436970724121E-07];
R = 0.461526; %kJ/(kg K)
Pi = p;
tau = 540 / T;
gr_tau = 0;
for i = 1 : 43
gr_tau = gr_tau + nr(i) * Pi ^ Ir(i) * Jr(i) * (tau - 0.5) ^ (Jr(i) - 1);
end
h2_pT = R * T * tau * (gr_tau);
end
function h2_pT = h2_pTnew(p, T)
Ir = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 6, 6, 6, 7, 7, 7, 8, 8, 9, 10, 10, 10, 16, 16, 18, 20, 20, 20, 21, 22, 23, 24, 24, 24];
Jr = [0, 1, 2, 3, 6, 1, 2, 4, 7, 36, 0, 1, 3, 6, 35, 1, 2, 3, 7, 3, 16, 35, 0, 11, 25, 8, 36, 13, 4, 10, 14, 29, 50, 57, 20, 35, 48, 21, 53, 39, 26, 40, 58];
nr = [-1.7731742473213E-03, -0.017834862292358, -0.045996013696365, -0.057581259083432, -0.05032527872793, -3.3032641670203E-05, -1.8948987516315E-04, -3.9392777243355E-03, -0.043797295650573, -2.6674547914087E-05, 2.0481737692309E-08, 4.3870667284435E-07, -3.227767723857E-05, -1.5033924542148E-03, -0.040668253562649, -7.8847309559367E-10, 1.2790717852285E-08, 4.8225372718507E-07, 2.2922076337661E-06, -1.6714766451061E-11, -2.1171472321355E-03, -23.895741934104, -5.905956432427E-18, -1.2621808899101E-06, -0.038946842435739, 1.1256211360459E-11, -8.2311340897998, 1.9809712802088E-08, 1.0406965210174E-19, -1.0234747095929E-13, -1.0018179379511E-09, -8.0882908646985E-11, 0.10693031879409, -0.33662250574171, 8.9185845355421E-25, 3.0629316876232E-13, -4.2002467698208E-06, -5.9056029685639E-26, 3.7826947613457E-06, -1.2768608934681E-15, 7.3087610595061E-29, 5.5414715350778E-17, -9.436970724121E-07];
R = 0.461526; %kJ/(kg K)
Pi = p;
tau = 540 / T;
gr_tau = sum( + nr.* Pi .^ Ir.* Jr .* (tau - 0.5) .^ (Jr - 1));
h2_pT = R * T * tau * ( gr_tau);
end

John D'Errico on 18 Sep 2022
Edited: John D'Errico on 18 Sep 2022
It is not uncommon to have vectorized code be slower than a brute force loop. Why?
The word "vectorization" can refer to many different coding paradigms. In a basic example, find the sum of the natural numbers 1 through 100.
n = 1:100;
sum(n)
ans =
5050
N = 100;
S = 0;
for ind = 1:N
S = S + ind;
end
S
S =
5050
Of course, both schemes produce the same end result. One of them generates the set of numbers internally, putting them into memory. Then an internal loop is generated to form the sum. I call that an implicit loop, since you never see the loop itself, but it is there. In the latter case, we have an explicit or external loop.
Will the vectorized version be faster? Well, usually yes. But supose N was 1e9? Or some number large enough, that just to populate that large a fragment of internal memory with these numbers is itself highly CPU intensive. You may even force MATLAB to use virtual memeory, swapping stuff in and out. So there is a point where the explicit loop may be better. (I recently recall an example running of exactly that case, in some recent response I wrote, but I write a LOT of responses...)
Of course, there are many other variations of codes, vectorized and unvectorized, that will gain or not, depending on the problem size. Sometimes it is the dimensionality of the problem that costs you.
For example, suppose I wanted to find all sets of 6 integers that sum to 50? I could write it like this:
tic,
Sets = zeros(0,6);
for n1 = 1:50
for n2 = n1:50
for n3 = n2:50
for n4 = n3:50
for n5 = n4:50
for n6 = n5:50
if (n1 + n2 + n3+ n4+ n5+ n6) == 50
Sets(end+1,:) = [n1, n2, n3, n4, n5, n6];
end
end
end
end
end
end
end
toc
Elapsed time is 0.096587 seconds.
size(Sets,1)
ans =
5427
So we have nested for loop, 6 levels deep. UGH. That actually ran pretty quickly, though if I made the problem a little larger, my computer will still grind to a halt. But it will terminate eventually. Totally unvectorized though.
Or, I could use ndgrid to generate ALL possible combinations of the numbers 1:45. Then isolate the ways they will add up to 50. Something like this:
tic,
elements = int16(1:45);
[N1,N2,N3,N4,N5,N6] = ndgrid(elements,elements,elements,elements,elements,elements);
ind = ((N1 + N2 + N3 + N4 + N5 + N6) == 50) & (N1 <= N2) & (N2 <= N3) & (N3 <= N4) & (N4 <= N5) & (N5 <= N6);
Sets = [N1(ind),N2(ind),N3(ind),N4(ind),N5(ind),N6(ind)];
toc
Elapsed time is 175.219414 seconds.
size(Sets,1)
ans =
5427
Each array will contain 45^6 elements, so approximately 8e9 elements, each of which is a double. (To be a little smart, I did this using int16 arrays.) Now find all combinations of those arrays, where they sum to 50. This scheme I tried on my computer (whcih has a fair amount of RAM) and almost gave up, with all 8 CPUS going full tilt, and my system fan running flat out for about 3 minutes.
Do you see the latter form is massively memory intensive, while the former is massively looped?
Another approach is to use my own partitions code, found on the FEX, where a recursive search is performed.
tic,P6 = partitions(50,1:45,6,6);toc
Elapsed time is 0.296554 seconds.
size(P6,1)
ans =
5427
Again, there are only 5427 solutions there. It is actually slower than the brute force looped approach, but faster than the massive memory solution.
There are other variations of vectorized solutions to problems, surely more than I can count easily, because people call all sorts of things vectorizations. Sometimes, a solution that uses mathematics is a "vectorization". For example, if you want to form the sum of the numbers 1:n, you might just use the formula, known even to Gauss:
N = 100;
N*(N+1)/2
ans =
5050
The point is, vectorized schemes vary widely. They are not always more efficient. Sometimes you may choose a vectorized scheme purely because it looks pretty. I can think of schemes to solve a problem that use tools like intlinprog or lsqlin or backslash, where the solution might then be called a vectorized solution.
What really matters is you understand what the vectorization is doing, and why it may or may not be efficient. Whether it actually gains will often be problem/parameter/size dependent. In the end, you gain when you understand the mathematics, when you understand the limitations of your computer, and when you understand how MATLAB will internally implement the code you write.

Chunru on 18 Sep 2022
With the improvement of the execution engine and jit, newer verions of matlab improve the for-loop performance. Very often, the for-loops no longer slowown performance. However, the vectorized code is more aligned with the matrix/array thinking that matlab promotes and has concise expressions usually.
dpb on 18 Sep 2022
Edited: dpb on 18 Sep 2022
I'd guess the extra is in the overhead of the additional function call to sum() here since the timed function is so small.
There's also an extra allocation that may/may not effect the time much in the temporary variable Pi instead of p used in the second function.
Also, was the timing done with code at the command window or as m-files? There's limited jit optimization at command line so need the m-file versions.