Handling large arrays - fragment into vectors?

12 views (last 30 days)
Cheers everyone,
I am working on a MATLAB code which routinely has the nasty habit of running out of memory. Here is why: I currently use a 3-dimensional array (m,n,k) for a 2-dimensional (m,n) matrix. Each cell of the 2-D array contains a vector of information k (which is procedurally generated) that can range in size somewhere between 1 and up to several million values. The resulting matrix would be massive, and significantly too large for MATLAB to handle. The vectors for k are processed one after the other, so I do not need to perform any matrix operations on the full 3-D array.
I am therefore looking for a way to handle this massive amount of data without running out of memory. I have already tried sparse matrices to account for the discrepancy in the length of the k vectors, but they did not solve the issue either. I considered saving the k vectors as ASCII on the hard drive, and loading them into the script as needed, but saving them this way would create up to 400,000 files. Alternatively, there seems to be a way to procedurally generate variable names (<http://www.mit.edu/~pwb/cssm/matlab-faq_4.html#evalcell)>, but using this also isn't recommended.
Do you know of a better way to handle data of this size?

Accepted Answer

Guillaume
Guillaume on 1 Mar 2017
Edited: Guillaume on 2 Mar 2017
I agree with Jan, the proper way to store your data in memory would be a cell array. I'd use a kx3 cell array, where column 1 and 2 are the row, column indices respectively and column 3 your vector.
I also agree with jan that adding more memory might be the most efficient solution. Failing that, you would have to use file(s). Matlab has some useful tools to handle big data with the datastore classes. Unfortunately, as far as I can tell, none of them are designed to work with variable length vectors.
So, I would use low level I/O to write and read a single file to store each row of the cell array as binary as parsing text is going to slow the code down for little benefit. For faster seeking, I would write the offset to each vector as a header. Something like:
function array2file(filepath, carray)
validateattributes(carray, {'cell'}, {'2d', 'ncols', 3}); %must be a cell array with 3 columns
fid = fopen(filepath, 'w');
%compute bytes offset to each vector (column 3 of cell array), with vector values stored as 64-bit double (8 bytes)
offsets = 8 * (cellfun(@numel, carray(:, 3)) + 1); %+2 for space for row,col, +1 for vector length, but -1 to get 0 offset
fwrite(fid, numel(offset), 'uint64'); %starts with storing the number of rows of cell array. uint64 may be a bit overkill for storage.
fwrite(fid, offsets, 'uint64');
for row = 1 : size(carray, 1)
fwrite(fid, [c{row, [1 2]}], 'double');
fwrite(fid, numel(c{row, 3}, 'uint64');
fwrite(fid, [c{row, 3}], 'double');
end
fclose(fid);
end
function [rowidx, colidx, v] = readindex(filepath, index)
%specify index 0 to get numbers of indices in the file
%indices = readindex(filepath, 0);
%[rowidx, colidx, v] = readindex(filepath, index) %with index > 0 && index <= indices returned by previous call
fid = fopen(filepath, 'r');
numindices = fread(fid, 1, 'uint64');
if index == 0
rowidx = numindices;
else
validateattributes(index, {'numeric'}, {'positive', 'integer', '<=', numindices});
offsets = fread(fid, numindices, 'uint64');
fseek(fid, offsets(index), 'cof')
rowidx = fread(fid, 1, 'double');
colidx = fread(fid, 1, 'double');
nelem = fread(fid, 1, 'uint64');
v = fread(fid, [1, nelem], 'double');
end
end
edit: made a mistake in the read function
  2 Comments
Max
Max on 1 Mar 2017
Thank you for your response. What you suggested sounds very interesting, and especially the reference to Datastore may prove exceptionally useful. This may take some tinkering...
Guillaume
Guillaume on 2 Mar 2017
Well, as said, I couldn't find any of the datastore that fits your requirements, so saving to file is probably best.
The draft code I wrote above is probably not exactly suitable as it assumes that the data can be written to file in one go. If it can be written to file in one go, it means that it can fit into memory, so there's little point to it. A slightly different file structure (offsets stored at the end) would allow you to write the file as vectors are generated.

Sign in to comment.

More Answers (1)

Jan
Jan on 1 Mar 2017
The detail "3-dimensional array (m,n,k) for a 2-dimensional (m,n) matrix. Each cell of the 2-D array contains a vector of information k" is not celar to me. But I guess, a {m x n} cell array wozuld be more efficient, because it does not require a massive block of contiguous memory. But it is the drawback, that Matlab needs about 100 bytes per cell element as overhead.
Please mention the typical dimensions. "Too large for MATLAB to handle" is not precise enough to estimate, how to process the data efficiently. Storing the data on the hard disk as ASCII is a really bad idea. At least write them in binary format. But disks are slow. Better install more RAM.
  1 Comment
Max
Max on 1 Mar 2017
Thank you for your response. My full matrix will be something along the lines of 600x600, with vectors for the 'third dimension' going into the millions in length.
Also thank you for the tip with the cells, this may indeed be exactly what I am looking for.

Sign in to comment.

Categories

Find more on Creating and Concatenating Matrices in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!