Pre-allocating Arrays and Optimizing memory

4 views (last 30 days)
I am working with some large data sets. I could have up to 16 (1000000 x 8) csv files.
I import with textscan, and immediately remove unused columns by doing:
for j=1:length(data)
data{j}{2}=[];
data{j}{5}=[];
etc...
The columns are empty, but "still there". Is there a way to actually remove them? I assume they are taking up space still since if I create an empty array, it still has a bytesize to it.
Also, does it make sense to pre-allocate for the maximum possible size by doing:
for j=1:16
data{j}{1000000,8}=[];
end
When I pre-allocate in that manner, the empty array is larger in bytesize than when I actually load the data file.

Accepted Answer

per isakson
per isakson on 25 Jan 2013
Edited: per isakson on 25 Jan 2013
Are you aware that textscan can skip columns when reading, e.g
'%f%*f%f'
reads the first and third column and skips the second.
.
Yes, "=[];" deletes columns and frees memory
>> m = magic(4);
>> m(:,3)= [];
>> whos m
Name Size Bytes Class Attributes
m 4x3 96 double
>>
.
pre-allocation cannot be done with "=[];". I mostly use
m = nan( rr,cc);
.
pre-allocation doesn't make much sense in your current case.

More Answers (2)

James Tursa
James Tursa on 5 Feb 2013
Edited: James Tursa on 5 Feb 2013
Some clarification on the use of cell arrays and []:
A cell array is an array of variable pointers. That is, the "data" of a cell array are pointers to other MATLAB variables. For the sake of the following discussion, assume we are talking about a 32-bit system so the pointers are 4-bytes each.
Consider the following statement:
C = cell(3,2)
C =
[] []
[] []
[] []
The above creates a variable C with 6 elements. Each element is a 4-byte pointer that is initialized to NULL (i.e., 0). They aren't pointing to anything yet.
Now consider this follow-on statement:
C{3,2} = []
C =
[] []
[] []
[] []
It doesn't look like anything in C has changed, but something has. The [] on the right-hand-side represents an empty double matrix. The statement above will put a shared data copy of right-hand-side into the (3,2) position of the cell array C (in point of fact, since there is no data to share you will end up with an unshared variable for this particular case). So C now in fact has 5 NULL values and one non-NULL value for elements, the latter being an empty double matrix and taking up memory (even though it prints to the screen as [] just like the actual NULL values).
Suppose we do this in a loop:
for k=1:6; C{k} = []; end
C =
[] []
[] []
[] []
Again it looks like nothing has changed for C, but again it certainly has behind the scenes. Each element of C now contains an empty double matrix. That is, each element of C contains a non-NULL pointer that points to a MATLAB variable that happens to be empty. You now have 6 extra variable headers taking up memory, even though it prints like the original when it was first created.
In all the cases above I am setting cell element values, not clearing them, because I used the curly braces { } on the left-hand-side.
Now suppose we do this:
C(:) = {[]}
C =
[] []
[] []
[] []
Again it looks like nothing has changed, but of course it has. In this case, however, there is only one extra variable header taking up memory. All of the pointer elements of C are EXACTLY THE SAME. For cell arrays (and structs) MATLAB has the ability to make reference copies of variables to use for elements as long as you do it in one fell swoop. MATLAB will keep track of this behind the scenes, but the bottom line is that all of the pointer values in C are exactly the same and there is only one physical empty double matrix in existence. This is an efficient way to wipe out the variable contents of a cell array without taking up much memory in the process. There is no way I know of at the m-file level to get all the elements back to a NULL state, btw, although it could be done in a mex routine (separate topic).
Now finally consider this syntax:
C(:,2) = []
C =
[]
[]
[]
Now we are finally getting to the syntax that physically deletes elements of C ... in this case the 2nd column. This is a special syntax in MATLAB for physically deleting elements of a variable and shrinking its size. This last syntax works for any variable, not just cell arrays.
Given all of the above discussion, now consider Jared's proposed code:
for j=1:length(data)
data{j}{2}=[];
data{j}{5}=[];
The above will replace the data{j}{2} element with an empty double matrix, and replace the data{j}{5} element with an empty double matrix. There will still be two variable headers in memory for these locations.
And for the following:
for j=1:16
data{j}{1000000,8}=[];
end
the data{j}{1000000,8} element will be replaced with an empty double matrix (complete with variable header retained in memory).
The only way to save on memory for this header storage would be to get all of the indexes together in a separate variable and then use that to replace them all with [] in one fell swoop. That would allow MATLAB to make reference copies of the [] matrix and you would be left with only a single physical empty double matrix in memory (all the other elements would be reference copies of this and take up 0 additional memory).
Final Comment: Except in very special circumstances (such as downstream in-place operations on variables), it is almost never a good practice to "initialize" cell array elements with [].
  1 Comment
James Tursa
James Tursa on 5 Feb 2013
I mentioned above that, to my knowledge, the only way to NULL out a cell array element is in a mex routine. I should add that doing this risks unintended side effects if the cell array itself is shared, and the only way to detect sharing is by hacking into the variable header and using undocumented API functions.

Sign in to comment.


Evan
Evan on 4 Feb 2013
Use smooth parentheses to remove a row/column/cell from a cell array.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!