Why is indexing vectors/matrices in MATLAB very inefficient?

This is a rather lengthy question, previously asked on http://stackoverflow.com/questions/13382155/is-indexing-vectors-in-matlab-inefficient. Since I did not get any definitive answer, I decided to ask it here, where it is more likely that MATLAB engineers will look.
Background
My question is motivated by simple observations, which somewhat undermine the beliefs/assumptions often held/made by experienced MATLAB users:
  • MATLAB is very well optimized when it comes to the built-in functions and the fundamental language features, such as indexing vectors and matrices.
  • Loops in MATLAB are slow (despite the JIT) and should generally be avoided if the algorithm can be expressed in a native, 'vectorized' manner.
The bottom line: core MATLAB functionality is efficient and trying to outperform it using MATLAB code is hard, if not impossible.
Investigating performance of vector indexing
The example codes shown below are as fundamental as it gets: I assign a scalar value to all vector entries. First, I allocate an empty vector x:
tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.
Having x I would like to set all its entries to the same value. In practice you would do it differently, e.g., x = value*ones(1e8,1), but the point here is to investigate the performance of vector indexing. The simplest way is to write:
tic; x(:) = 1; toc
Elapsed time is 0.094316 seconds.
Let's call it method 1 (from the value assigned to x). It seems to be very fast (faster at least than memory allocation). Because the only thing I do here is operate on memory, I can estimate the efficiency of this code by calculating the obtained effective memory bandwidth and comparing it to the hardware memory bandwidth of my computer:
eff_bandwidth = numel(x) * 8 bytes per double * 2 / time
In the above, I multiply by 2 because unless SSE streaming is used, setting values in memory requires that the vector is both read from and written to the memory. In the above example:
eff_bandwidth(1) = 1e8*8*2/0.094316 = 17 Gb/s
STREAM-benchmarked memory bandwidth ( https://www.cs.virginia.edu/stream/ ) of my computer is around 17.9 Gb/s, so indeed - MATLAB delivers close to peak performance in this case! So far, so good.
Method 1 is suitable if you want to set all vector elements to some value. But if you want to access elements every step entries, you need to substitute the : with e.g., 1:step:end. Below is a direct speed comparison with method 1:
tic; x(1:end) = 2; toc
Elapsed time is 0.496476 seconds.
While you would not expect it to perform any different, method 2 is clearly big trouble: factor 5 slowdown for no reason. My suspicion is that in this case MATLAB explicitly allocates the index vector (1:end). This is somewhat confirmed by using explicit vector size instead of end:
tic; x(1:1e8) = 3; toc
Elapsed time is 0.482083 seconds.
Methods 2 and 3 perform equally bad.
Another possibility is to explicitly create an index vector id and use it to index x. This gives you the most flexible indexing capabilities. In our case:
tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc
Elapsed time is 1.208419 seconds.
Now that is really something - 12 times slowdown compared to method 1! I understand it should perform worse than method 1 because of the additional memory used for id, but why is it so much worse than methods 2 and 3?
Let's give the loops a try - as hopeless as it may sound.
tic;
for i=1:numel(x)
x(i) = 5;
end
toc
Elapsed time is 0.788944 seconds.
A big surprise - a loop beats a `vectorized` method 4, but is still slower than methods 1, 2 and 3. It turns out that in this particular case you can do it better:
tic;
for i=1:1e8
x(i) = 6;
end
toc
Elapsed time is 0.321246 seconds.
And that is the probably the most bizarre outcome of this study - a MATLAB-written loop significantly outperforms native vector indexing. That should certainly not be so. Note that the JIT'ed loop is still 3 times slower than the theoretical peak almost obtained by method 1. So there is still plenty of room for improvement. It is just surprising (a stronger word would be more suitable) that usual 'vectorized' indexing (`1:end`) is even slower.
My attempt to get more insights into the problem
I do not have an answer to all the problems, but I do have some refined speculations on methods 2, 3 and 4.
Regarding methods 2 and 3. It does indeed seem that MATLAB allocates memory for the vector indices and fills it with values from 1 to 1e8. To understand it, lets see what is going on. By default, MATLAB uses double as its data type. Allocating the index array takes the same time as allocating x
tic; x = zeros(1e8,1); toc
Elapsed time is 0.260525 seconds.
For now, the index array contains only zeros. Assigning values to the x vector in an optimal way, as in method 1, takes 0.094316 seconds. Now, the index vector has to be read from the memory so that it can be used in indexing. That is additional 0.094316/2 seconds. Recall that in x(:)=1 vector x has to be both read from and written to the memory. So only reading it takes half the time. Assuming this is all that is done in x(1:end)=value, the total time of methods 2 and 3 should be
t = 0.260525+0.094316+0.094316/2 = 0.402
It is almost correct, but not quite. I can only speculate, but filling the index vector with values is probably done as an additional step and takes additional 0.094316 seconds. Hence, t=0.4963, which more or less fits with the time of methods 2 and 3.
These are only speculations, but they do seem to confirm that MATLAB explicitly creates index vectors when doing native vector indexing. Personally, I consider this to be a performance bug. MATLABs JIT compiler should be smart enough to understand this trivial construct and convert it to a call to a correct internal function. As it is now, on the todays memory bandwidth bounded architectures indexing performs at around 20% theoretical peak.
Regarding method 4, one can try to analyze the performance of the individual steps as follows:
tic;
id = 1:1e8; % colon(1,1e8);
toc
tic
x(id) = 4;
toc
Elapsed time is 0.475243 seconds.
Elapsed time is 0.763450 seconds.
The first step, allocation and filling the values of the index vector takes the same time as methods 2 and 3 alone. It seems that it is way too much - it should take at most the time needed to allocate the memory and to set the values ( 0.260525s+0.094316s = 0.3548s ), so there is an additional overhead of 0.12 seconds somewhere, which I can not understand. The second part ( x(id) = 4 ) looks also very inefficient: it should take the time needed to set the values of x, and to read the id vector ( 0.094316s+0.094316/2s = 0.1415s ) plus some error checks on the id values. Programed in C, the two steps take:
id = 1:n 0.214259
x(id) = 4 0.219768
The code used checks that a double index in fact represents an integer, and that it fits the size of x:
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
static struct timeval tb, te;
void tic()
{
gettimeofday(&tb, NULL);
}
void toc(const char *idtxt)
{
long s,u;
gettimeofday(&te, NULL);
s=te.tv_sec-tb.tv_sec;
u=te.tv_usec-tb.tv_usec;
printf("%-30s%10li.%.6li\n", idtxt,
(s*1000000+u)/1000000, (s*1000000+u)%1000000);
}
int main(int argc, char *argv[])
{
double *x = NULL;
double *id = NULL;
int i, n;
// vector size is a command line parameter
n = atoi(argv[1]);
printf("x size %i\n", n);
// not included in timing in MATLAB
x = calloc(sizeof(double),n);
memset(x, 0, sizeof(double)*n);
// create index vector
tic();
id = malloc(sizeof(double)*n);
for(i=0; i<n; i++) id[i] = i;
toc("id = 1:n");
// use id to index x and set all entries to 4
tic();
for(i=0; i<n; i++) {
long iid = (long)id[i];
if(iid>=0 && iid<n && (double)iid==id[i]){
x[iid] = 4;
} else break;
}
toc("x(id) = 4");
}
The second step takes longer than the expected 0.1415s - that is due to the necessity of error checks on id values. The overhead of the checks seems too large to me - most likely my code is NOT optimal and it could be written better. Even so, the time required is 0.4340s , not 1.208419s. What MATLAB does under the hood - I have no idea. Maybe it is necessary to do it, I just don't see it.
Questions
  • is simple indexing in MATLAB very inefficient (methods 2, 3, and 4 are slower than method 1), or did I miss something?
  • why is method 4 (so much) slower than methods 2 and 3?
  • why does using 1e8 instead of numel(x) as a loop bound speed up the code by factor 2?
As I see it now, we can not do anything to improve the situation working within the MATLAB framework. Can we hope the issues will be fixed in some new versions of MATLAB? Or are the described performance problems not considered issues?
Test code
function test
tic; x = zeros(1,1e8); toc
tic; x(:) = 1; toc
tic; x(1:end) = 2; toc
tic; x(1:1e8) = 3; toc
tic;
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
toc
tic;
for i=1:numel(x)
x(i) = 5;
end
toc
tic;
for i=1:1e8
x(i) = 6;
end
toc
end
Edit The following tests (modifications to methods 1, 2, and 3 using subsasgn) have been suggested by Daniel
S = struct('type', '()', 'subs', {{':'}});
tic; x = subsasgn(x, S, 1); toc
S = struct('type', '()', 'subs', {{'1:end'}});
tic; x = subsasgn(x, S, 2); toc
S = struct('type', '()', 'subs', {{'1:1e8'}});
tic; x = subsasgn(x, S, 3); toc
Elapsed time is 0.451859 seconds.
Elapsed time is 0.363623 seconds.
Elapsed time is 0.359140 seconds.
Now I am even more lost. Definitely, x(:) works bad now. But interestingly, methods 2 and 3 work faster than the direct implementation, which suggests that subasgn is not used directly when running x(1:end) and x(1:1e8). Since my results differ from what Daniel observed, I should mention that I use Matlab 2011b on Linux/Ubuntu.

3 Comments

Were all of these tests done in functions or at the command line?
Daniel, the code used is included in the question.

Sign in to comment.

Answers (5)

Jan
Jan on 23 Nov 2012
Edited: Jan on 23 Nov 2012
Loops can be faster than vectorized solutions, when the allocation of temporary arrays is more expensive thant the calculations.
Another difference between id=1:1e8; x(id)=4 and x(1:1e8)=4 is the range check: While in the in the first case Matlab checks 1e8 times if the index is inside the boundaries, for the 2nd case only two tests are performed.
Unfortunately this concerns logical indexing also, at least it looks like this. A naive C-Mex function, which checks the length of the logical index ones, is by a fixed noticable factor faster than doing this in Matlab directly. I'm going to post a version in the FEX.
[EDITED]
Version 1:
id = 1:1e8;
x(id) = 4;
This creates 1:1e8 explicitly. In the assignment the elements pf x and the index vector are stored in the processor caches. In addition a boundary check is performed for each index (and this requires much more time with my Win64/MSVC2008 or MSVC2010 or OWC1.8 setup than in your measurements). ==> This is slow.
Version 2:
for i = 1:numel(x)
x(i) = 5;
end
Matlab's JIT accelerates this and the loops avoid the explicit creation of the index vector. In addition the (not existing) index vector does not pollute the caches. It is documented, that "numel(x)" is evaluated once only when the loop starts.
Version 3:
for i = 1:1e8
x(i) = 5;
end
Obviously the JIT can use the fixed width to improve the speed compared to the version 2. Fine. Perhaps the boundary checks are omitted.
Version 4:
x(:) = 5;
Now Matlab can show its power: Because this cannot be affected by any side effects, the JIT calls an efficient method.
For an interpreted language it is not surprising, that the other methods do not have the theoretical memory bandwidth. The power of Matlab is not the performance, but the fast prototyping. This is useful, when the time for programming and debugging exceeds the run-time.
Therefore I do not see a performance bug. The JIT is not as powerful as it could theoretically be. But Matlab is a very conservative programming environment. Improving the JIT mean modifying it, and this can cause bugs due to the complexity of this task. This concerns other optimizers also, e.g. in C++ compilers: The optimization flags influence the results of large numerical programs. As long as the effects are "small", this is not seen as a bug. But Matlab's JIT tries not to change the results. And therefore it has less power.
Btw.:
tic;
id = 1:1e6;
x(id) = 4;
toc % Elapsed time is 0.042795 seconds.
tic;
id = uint32(1):uint32(1e6);
x(id) = 4;
toc % Elapsed time is 0.033133 seconds.
tic;
id = true(1,1e6); % Logical indexing
x(id) = 4;
toc & Elapsed time is 0.025935 seconds.

13 Comments

Jan, thank you for your answer. To simplify things, running x(1:end)=val should be the same as x(:)=val. In fact, I think that any indexing included in the expression itself, i.e., not using another array id as indexes, should be interpreted by the JIT and transformed into an efficient loop / a call to an internal function. As it is, it seems that instead MATLAB generates arrays of indexes (temporary id) and uses it as in the general case. This behaviour can of course be changed - it is not necessary to explicitly generate an array of integers from 1 to end! That I consider to be a performance bug on today's memory bounded architectures.
Checking the index bounds does not introduce a significant overhead - check the C code I have included in my 'speculations'. Although I admit it is there. But this only applies to the most general indexing case with id array, not to x(1:end) constructs. There you have all the information you need to check the bounds before execution. Instead, MATLAB seems to generate an index vector full of integers, to later check the bounds on them. Looks like biting it's own tail.
See [EDITED]. For all tests I ran, checking the boundaries caused a massive overhead. The IF branch impedes the prediction of the processor.
The timings indicate, that in x(1:end) the index vector 1:length(x) is not explicitly created.
Jan, I will have a further look into your observations. Quick comments for now:
For all tests I ran, checking the boundaries caused a massive overhead. The IF branch impedes the prediction of the processor.
First, I have attached C code and results on my computer, which clearly show this is not the case. What MATLAB does in 1.2s, my C code does in 0.4. I agree there can be some overhead, but not that much.
Also, branch prediction works if most of the time the jump is the same. Common case is that the indexes are integers and do fall within bounds, hence there is no problem with branching in my opinion. See my C code and the timings.
The power of Matlab is not the performance, but the fast prototyping.
Again, this is a common view, but I have a different opinion. For me MATLAB is much more than prototyping. I use it for normal production codes. Efficient if you know what you are doing. Easy to program. I do not think that TMW want to limit their client base to people, who write prototypes. I for one want to use it as a serious development tool and sell/deploy the programs. And I do expect performance :) Hence my view of it as a performance bug.
Anyway, TMW do think about performance to some extent. After all, they spent a lot of effort on the Parallel Computing and distributed computing environments. I feel I cought them on a very basic performance problem, which in my opinion could be fixed relatively easily (maybe not - why don't they comment?). Why not stop and think how to speed up the usual, sequential version signifficantly by just changing the way you implement indexing?
About changing JIT. I know developing software as large as MATLAB is difficult and changes need to be carefully done. However, it does not mean the software should not adapt to new technology. Research on compilers results in new compilers and the old ones being modified/re-written. Both things happen. Instead of completely re-writing the user interface I would suggest TMW concentrate on performance. If they do not, and someone else comes up with a competing solution, many people might want to reconsider.
I agree with most of that, though I don't quite understand the idea of viewing an opportunity missed for optimization as a "bug". By that reasoning, every time TMW releases a performance improvement, it would have to be considered a bug-fix.
@Matt J. Oh, that's an interesting philosophical discussion :) And I see your ponit, but I think I could defend my "naming convention" at least a little bit:
  • 20% performance improvement = heavy optimization
  • 80% performance improvement = ??a) bug fix orb) re-thinking an old design decision in the face of changing hardware constraints
Also, as we have seen the performance is not consistent for different ways of implementing essentially the same thing. That makes it really difficult for me to have predictable performance. Often enough one has to code all possibilities and see what works best, which is not so nice.
I realize that when the software was developed the memory bandwidth was not such a big bottleneck. It is now, and only more so in the future, unless we make breakthroughs in memory speed. When does a missed optimization opportunity become a bug? In a year? In 5 years?
I have no problem with whatever name TMW promotion department would give to this 'potential new feature' in the release notes. As long as it gets done ;) I prefer to call it a bug because I hope that fixing serious bugs gets more priority and attention than a feature request that may be viewed by many as not so crucial.
Jan
Jan on 25 Nov 2012
Edited: Jan on 25 Nov 2012
@angainor: Your performance test by using C-code with and without boundary tests lead to about 2.5% runtime difference with your compiler and on your machine. My comparisons with another code on another machine and perhaps another compiler give a higher overhead. Is this surprising? Consider that the branch prediction depends on the number of free threads: On a single threaded Pentium4 the performance differs significantly from a i7 with 4 cores and hyper-threading. Therefore posting an incomplete code-snippet does not allow to discuss about "performance" exhaustively.
In my opinion "bug" means a wrong result. Therefore "performance bug" needs a different level of definition, e.g. the computations would be correct, but the machine breaks down during the computations due to erosion. But to become serious again: My professor for numerics has 40 years of programming experiences. A performance jump of factor 5 cannot impress him, because waiting some years an run the same code on a cheap but modern PC has the same effect.
@Jan I have updated the question with the full source file. If you use windows you will need to reimplement tic/toc - I use gettimeofday.
Your performance test by using C-code with and without boundary tests lead to about 2.5%
I am afraid you misunderstood my C test. First timing is for memory allocation and setting the id vector values, the second is using id vector to index x, including checking the bounds and double to int cast. I do not compare those numbers - they are similar by coincidence. Also, 2.5% is not 500%. First one is not surprising, second one is.
What I compare in my C code is exactly the functionality of method 4 in matlab:
  • first, I set the values of id vector of doubles to 1:n, including the for memory allocation of id
  • next, I use the id vector to index x, including bound checking and type conversion from double to int, to set all x entries to 4
That is a fair comparison. I use gcc-4.4.6 with -O3 flag to compile and run on Ubuntu 11.10. I will gladly run your code and look at the results.
Both the matlab and the C version were run on the same i7-based computer. I only use 1 CPU core, so hyperthreading/number of cores does not matter. MATLABs version is 4 times slower, and I still claim my version is not the best, because the memory bandwidth analysis shows me it is not optimal. I'll try to come up with a better one. Finding a worse one, or finding a worse compiler only proves that things can be implemented badly, and that they can be optimized.
Of course, a bug is when things do not work. That's why I called it a performance bug. I do not know, if it has a formal definition, but here are a few links - others also use this term to denote performance problems that real users actually notice. I did :)
But fine, I can change the term I use. How do you suggest I report it to TMW so that they consider it as a problem and improve it? I know I can report bugs.
@Angainor: Please do not get me wrong: I agree that the performance of Matlab is not optimal and if a user has certain expectations it can even be called "not satisfactorily". As you can see in my FEX submissions I frequently use C to workaround limitations in Matlab. E.g. the trivial anyEq replaces the Matlab command any(X==y): For large X this takes half of the time used by Matlab in the worst case (no matching element), because creating the temporary logical vector X==y consumes resources. In the best case, when the first element matches already, the time required by the C-code does not depend on the size of the problem, such that it is 2600 times faster for a 1x1e7 input. It is very sad, that Matlab's JIT cannot avoid the creation of the temporary arrays when it is obvious that the user does not want to create them actually. But currently the power of the JIT is limited.
Therefore I try again to express myself more clearly: Yes, you observed a frequently mentioned problem. TMW has invented the JIT to handle this problem and in some cases this works efficient already. In some cases the JIT introduced real bugs and the users have been very upset. I can relinquish a speedup in one assignment procedure, which lets the total program run some percent faster, when this costs 2 month of debugging.
As you I dream of a much more powerful JIT and e.g. the browser manufacturers have proven, that the interpreted JavaScript has an enormous potential for improvements also. Of course some C-code can be very much faster for a specific problem, an no doubt I would be very happy, if you send an enhancement report and TMW implements the corresponding improvements (successfully...).
And until this happens, I will use the already available features to run Matlab as efficient as possible.
It is hard for me to find the "method 4" in your question. I'm not sure if I understand the purpose of the posted C-code completely. But finally I would not be surprised, if a hard coded method programmed in a low-level language is faster than the Matlab version, which can work with SINGLEs, integers, imaginary values and user-defined objects also. A hand optimized assembler using SSE4 version will be even faster.
@Jan, I'm still wondering if what we're discussing here is attributable to a "frequently mentioned problem" as you put it. I assume that you meant by that the need to create temporary arrays. But that doesn't seem to explain it to me. I'm still seeing a significant difference between
tic; x(1:1e8)=2; toc
%Elapsed time is 0.564816 seconds.
and
idx=1:1e8;
tic; x(idx)=2; toc
%Elapsed time is 0.807149 seconds.
The first version is still much faster even though it is handicapped by the creation of a temporary array. (I assume we all now agree that the first version does create a temporary index array). How do we explain this? Caching?
@Matt J: Daniel, angainor and me claim that "x(1:1e8) does not create 1:1e8 explicitly". As far as I understood, angainor asks, why "x(1:1e8)=4" is so much slower than "x=zeros(1,1e8); x(:)=4;".
About the "frequently mentioned problem": See the already suggested http://undocumentedmatlab.com/blog/preallocation-performance/. Matlab performes less operations inplace than possible in theory.
@Jan: I haven't seen angainor join you and Daniel in that claim. In fact, we mentioned tests (in Comments to Daniel under my Answer) that seem to refute it. I definitely see x(1:1e8)=2 producing a memory allocation spike in the Task Manager, even when x is already pre-allocated.
And I'm still not seeing the connection between the undocumentedmatlab material and this issue. Some operations are not done inplace, but is there a theory that subsasgn operations are not?
@Matt J: I've copied the wrong link. Actually I meant http://undocumentedmatlab.com/blog/matrix-processing-performance/.
I see that there are arguments for both theories about the (avoided) explicit allocation in "x(1:1e8)". All I've heard about the JIT let me think, that it depends on the specific code. E.g. it can matter if the code runs in the command line or in a function. Therefore I still think, that this discussion is theoretical only, not to say fruitless. We cannot influence the JIT directly, the JIT differs from release to release, TMW has to consider thousnads of different aspects when modifying it. And finally filling the memory with a scalar is most likely not a bottleneck in any real-world program, because this redundant information could be handles much more efficiently.
@angainor: Please lead me back to the actual point: How can we help you or what is the core of your message?
@MattJ @Jan To clarify, I do see memory being allocated when running x(1:1e8)
Please lead me back to the actual point: How can we help you or what is the core of your message?
My initial intention was to (maybe) get an answer from an engineer at TMW on the issue. I think I have managed to show that indexing is inefficient (5 times overhead over a C loop is in my opinion not reasonable). What I wanted to know is:
  • is MATLAB indeed implementing loops by explicitly generating integer indices and storing them as doubles?
  • is it a bug that using numel(x) as a loop bound instead of 1e8 slows down the code by factor 2? Should I report it?
  • how should I report the indexing problem as a whole to TMW so that they seriously consider it?
I have some prior experience with the tech support answering my questions, and let's just say sometimes the questions were too technical to be answered just by any person. This one definitely is a tough one, both technically and from the point of view of future MATLAB design decisions, so I suspect I will have trouble getting it through the technical support. I guess I will do that anyway, now that it is well documented here.

Sign in to comment.

Never mind any of that. Try explaining this:
x = zeros(1e8,1);
id = 1:1e8;
tic;
for i=id
x(i) = 4;
end
toc
%Elapsed time is 101.974125 seconds.
I think the bottom line is this. Vectorized methods are generally going to be slow because they always require you to generate and allocate memory for a dedicated index array. Some vectorized situations are accelerated more than others, however, like
x(1:1e8)=2
because MATLAB recognizes that 1:1e8 is only temporarily required and does a better job of caching it.
For-loops can in some cases do better than vectorized methods because the JIT can recognize certain patterns and situations where the loop is jumping in regular increments and doesn't need to allocate memory for a dedicated index array.
As for why numel(x) slows a for-loop down, I imagine the situation is similar to C/C++. At every iteration of the loop, i must be compared to numel(x), which is stored in some temporary variable, to check if the loop is supposed to terminate. Variables are not cached as well as constants like 1e8.

11 Comments

Vectorized methods are generally going to be slow because they always require you to generate and allocate memory for a dedicated index array.
I am happy we agree. AFAIK, this is not a documented fact, but my speculation based on performance analysis.
MATLAB recognizes that 1:1e8 is only temporarily required and does a better job of caching it.
A vector of integers is not needed to implement a loop. That is my point. x(:)=val should perform the same as x(1:end)=val. JIT should analyze such constructs and execute a correct internal function for them. If it does not do that, it may be a valuable suggestion to TMW to implement such functionality. Most codes use vectorized array/matrix indexing. Improving the performance of those by factor 5 could benefit many people...
As for why numel(x) slows a for-loop down, I imagine the situation is similar to C/C++.
It is not. Using runtime-known loop bounds does not introduce any significant overhead on modern optimizing compilers. You establish the loop bound once, before loop execution. The compilers do automatic loop unrolling and other transformations, if needed.
I am happy we agree. AFAIK, this is not a documented fact, but my speculation based on performance analysis.
Well, it's reasonably well-documented that colon-ed expressions normally generate an explicit array and cases where they don't are exceptions rather than the rule. I guess we're discovering based on further Comments below that exceptions include JIT-accelerated for-loops and indexing expressions for built-in data types.
Not all indexing expressions are exceptions, however. For user-defined data types, an explicit index array does get created and gets passed to subsref/subsasgn in a form generated by SUBSTRUCT.
In the loop numel(x) is fixed at the start of the loop.
x = ones(1, 3);
for ii = 1:numel(x)
x = [x, 1]
end
numel(x)
MATLAB is not making a temporary variable with x(1:1e8)=2. On my poor little Windows box
x = zeros(1e8, 1);
y = zeros(1e8, 1);
??? Out of memory. Type HELP MEMORY for your options.
but
x(1:1e8) = 2;
works just fine...
It is not. Using runtime-known loop bounds does not introduce any significant overhead on modern optimizing compilers. You establish the loop bound once, before loop execution. The compilers do automatic loop unrolling and other transformations, if needed.
I used to find that this loop
for (i=m;i--;i>=0)
would run faster in C than when it was run forward
for (i=0;i++;i<=m)
and the explanation given to me was that the condition i>=0 was quicker to evaluate repeatedly since it involved only comparison with a constant whereas i<=m involved the comparison with another variable. Was I just using a bad compiler, then?
Jan
Jan on 23 Nov 2012
Edited: Jan on 23 Nov 2012
@Matt J: The comparison of "i <= m" is more expensive than "i >= 0", which can be done directly by a processor command. But this concerns the comparison to 0 only, while "i >= 17" cannot profit.
@Daniel: I confirm this. x(1:1e8) does not create 1:1e8 explicitly.
This is just idle speculation, but in C/C++, is pointer incrementation of the form p++ faster in a loop than p=p+a?
If so, maybe the C/C++ implementation of x(:)=val uses p++ incrementation while x(1:a:n)=val is implemented with p=p+a incrementation, even when a=1, and maybe that accounts for the difference in loop speed...
@MattJ Was I just using a bad compiler, then?
I do not know exactly what case you have in mind. I am saying that this loop
for(i=0; i<1e8; i++)
and this loop
for(i=0; i<n; i++)
where n is only known at runtime have the same efficiency in C/C++ for any decent optimizing compiler.
If so, maybe the C/C++ implementation of x(:)=val uses p++ incrementation while x(1:a:n)=val
I am discussing 5 times overhead over just writing the memory. The time to manage the loop is insignifficant on modern memory bandwidth bounded architectures compared to the time it takes to read/write the memory, i.e., the instructions take ~ no time since they are overlapped with memory transfers. The problem is not just the way you write the loop, or even range checks/double to int cast, as I have shown in my C code. This is something much more fundamental.
I guess we're discovering based on further Comments below that exceptions include JIT-accelerated for-loops and indexing expressions for built-in data types.
Great. My wish/dream/expectation as a customer is that TMW optimizes native matrix/vector indexing. It can be done as an exception, if necessery :) Since some commentators claim the index vectors are not explicitly created (I am not conviced yet, but maybe), I would like TMW to eliminate the large (5 times!) overhead over what is strictly necessary. x(:) performance is optimal, hence it is possible and there is no inherent problem with MATLAB - it can do things well.
@Jan @Daniel : I confirm this. x(1:1e8) does not create 1:1e8 explicitly.
How did you check that? I do not confirm that. I just executed the statements while looking at MATLABs memory consumption and see that the memory usage clearly goes up by almost 800Mb. What matlab version are you using?
I just executed the statements while looking at MATLABs memory consumption and see that the memory usage clearly goes up by almost 800Mb.
Confirmed! (in R2012a 64-bit)
Well I guess the plot thickens.
performance is optimal, hence it is possible and there is no inherent problem with MATLAB - it can do things well.
Yes, it's possible and maybe it's on their TODO list, but the question is how much development work it would take TMW to make the optimization you're talking about. Basically what you seem to be proving is that all indexing operations, even on native matrices/vectors, generate index vectors and pass those vectors to SUBSASGN, except of course in the particular case where it passes only the string ':'. The reason it is possible to optimize x(:) is because when the ':' string gets passed to SUBSASGN, it can be easily recognized as a special case. What you're now asking is that the JIT recognize colon expressions separately and pass a new/different kind of symbol to subsasgn to encode that. Not sure how much development effort that would require or how much extra burden that would put on the JIT.
There are also many other sub-optimalities you could point out to TMW. What about the following performance difference, for example? Does it make sense that something so slightly different than x(:) should be about a factor of 2 slower?
tic;
x(:)=2;
toc;
%Elapsed time is 0.054764 seconds.
tic;
x(:,1)=3;
toc;
%Elapsed time is 0.098193 seconds.
@Matt J Does it make sense that something so slightly different than x(:) should be about a factor of 2 slower?
No :) Hence my concern - indexing in MATLAB should in my opinion be re-designed and re-implemented. My question tried to point it out on the simplest possible example. I hope it is not on a todo list, but already in their working SVN branch ;)
About effort - why not put it here instead of re-writing the user interface, which was not that bad as it was? My right to complain as the customer :)
But really, I do think that indexing is so fundamental, that it needs some serious attention.

Sign in to comment.

MATLAB is not as simple as you suggest. You were looking for the extra read and write overhead you observe with x(1:end) and x(1:1e8). I would suggest that this overhead is from subsasgn and I would add two additional tests to your list. According to the documentation x(:) should be equivalent to
S = struct('type', '()', 'subs', {{':'}});
tic; x = subsasgn(x, S, 4); toc
and the other three examples should be equivalent to
S = struct('type', '()', 'subs', {{1:1e8}});
tic; x = subsasgn(x, S, 5); toc
On my machine
S = struct('type', '()', 'subs', {{':'}});
tic; x = subsasgn(x, S, 4); toc
is slower than
x(:) = 1;
and
S = struct('type', '()', 'subs', {{1:1e8}});
tic; x = subsasgn(x, S, 5); toc
is slower than
x(1:end) = 2;
and
x(1:1e8) = 3;
but the same speed as
id = 1:1e8; % colon(1,1e8);
x(id) = 4;
This suggests that MATLAB speeds up x(:), x(1:end) and x(1:1e8), possibly by avoiding the overhead of subsasgn. I think this is probably the case of x(:). For x(1:end) and x(1:1e8), however, I think subsasgn is being used since
S = struct('type', '()', 'subs', {{':'}});
tic; x = subsasgn(x, S, 4); toc
takes about the same time as both
x(1:end) = 2;
and
x(1:1e8) = 3;
suggesting some sort of minimum overhead with the subsasgn route. This to me suggests that MATLAB is not calling subsasgn like the documentation says. This is not surprising since subsasgn is a tricky beast in that it cannot be overloaded and the JIT helps where it can.
As for the loops who knows what the JIT is doing. Some things are optimized and some things are not. Some things that were optimized in one version become un-optimized in another version.

3 Comments

If x(1:1e8) is not creating a temporary index array 1:1e8, as you argued in your Comment to me, then I don't see how it could be calling subsasgn. If it's calling subsasgn, what 'subs' data is being passed to it? You mean it's translating x(1:1e8) into x(:) and then calling SUBSASGN with subs=':'? In my timing results it's taking a really long time to do that translation, on the order of .01 sec.
Incidentally, I'm not clear what you meant by "subsasgn is a tricky beast in that it cannot be overloaded", since people overload it all the time when writing classdefs.
Yes, subsasgn can be (and often is) overload for custom classes, but it cannot be overloaded for builtin classes. According to the documentation: "Therefore, if A is an array of class double, and there is an @double/subsasgn method on your MATLAB path, the statement A(I) = B calls the MATLAB built-in subsasgn function."
As for x(1:1e8) creating a temporary variable, I am looking into that now. I don't know what the input to subsasgn is, but it doesn't seem to require much memory and it appears to be able to handle so transformations (e.g., 1:M:N) without creating a temporary variable ...
Daniel, thanks a lot for the new tests and your observation. I have updated my question with new results. As you will see, they are not like yours. In my case subsasgn was faster than x(1:end) and x(1:1e8), but much slower than x(:) - here the time is the same as original methods 2 and 2 (x(1:end)=).
More confused.
About not calling subsasgn. I would love that. That is all I want. The matrix/vector indexing to be efficient as in the original x(:) case.
As a side note. Looking at the memory usage of MATLAB in system monitor I see that the memory consumption goes up by ~800MB when I execute x(1:end)=val.

Sign in to comment.

Some tests revealed, that even logical indexing can be implemented more efficiently: Fex: CopyMask . This C-Mex function is about 2 to 3 times faster than Matlab's built-in method:
function test
X = rand(8000,8000);
tic
for k = 1:10
Y = CopyMask(X, X > 0.2);
end
toc
tic
for k = 1:10
Y = X(X > 0.2);
end
toc
The fact that different means of indexing take different amount of time is interesting, but probably not that useful to understand. The timings appear to depend on the MATLAB version and possibly OS. In my opinion optimizing MATLAB for
x(1:n) = a;
would be a big mistake. I would much rather have MATLAB optimized for something like
x(randi(n, n/100, 1)) = randn(n, 1);
which I don't think its the same problem.

1 Comment

You are of course right in that it is a simplistic example, maybe not that useful. But I often use constructs like those:
x(1:end-1,:)
x(:,1:2:end,:)-x(:,2:2:end,:)
x(1:5,:)
x(x>0) = val
and so on. And apparently I should be very careful not to use this
x(1:end)
which is silly, but I will not complain about it too much.
Your example with random indices is also not very realistic. But apart from the overhead of rand and terribly unlocalized memory access that kills the performance due to cache misses (that is users fault, not TMW), the indexing itself is exactly what I tested in method 4:
id = 1:numel(x);
x(id) = 1;
and that turned out to be (at least comparing to my C code) 4 times too slow.
I only complain about it, because indexing is a fundamental feature I am not able to improve upon while staying within the MATLAB framework. I do not complain about an inefficient sparse function - I go an implement it in my own MEX file. I can not do it here, and so I am not able to deliver the performance I am expected to deliver with MATLAB. And I would like to.

Sign in to comment.

Asked:

on 22 Nov 2012

Answered:

Jan
on 13 Apr 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!