How to write data to a binary file at a specific position?

16 views (last 30 days)
Hello,
Let us say that my data looks like this -
data = [1,1,1,1,1;...
2,2,2,2,2; ...
3,3,3,3,3];
I would like a write this data to a binary file such that it looks like - [1;2;3;1;2;3;1;2;3;1;2;3 ... and so on].
Now for a small file, I can easily do this as - fwrite(fp, data(:), 'int16'); However, for a very large data file (where data size is 100*1e10 or more), it becomes extraordinary slow. The raw data is stored as deparate files for each row, so I can read the data row by row. So, is it possible to write data to a binary file in a specific position?
Thank you for help!
  6 Comments
Jan
Jan on 25 Mar 2022
Edited: Jan on 25 Mar 2022
Why do you use dummy data, if you have create some test data before?
What is the purpose of this code:
output_data = zeros(nrows*rowsize,1);
for i = 1:nrows
this_row = data(i, :); % This is meant, isn't it?
output_data(i:nrows:end) = this_row;
end
It is an expensive version of:
output_data = data(:);
But you have written this line already. Therefore I do not understand, what the 2nd code should demonstrate. Simply omit the expensive loop.
Let's start with some test data:
rowsize = 1e7;
nrows = 10;
data = randi([0, 32767], nrows, rowsize, 'int16');
What do you want to do now? What is the relation of the shown code and the question about writing data at specific positions into a file?
By the way, there is no 'b' format anymore in fopen for over 20 years now. Simply use 'W'.
NeuronDB
NeuronDB on 25 Mar 2022
Hi Jan,
The raw data is stored in separate files for each row. So I need to loop through the files to read each row, append the data in the workspace cat(1, data, new_row), then do data(:), then write to binary file. But this requires storing the large arrays in the workspace before writing it to the data file. I would like to just read the first row, write to data file, then read the next row and so on... so to save memory and speed up!
Thank you in advance!

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 25 Mar 2022
First (and this is important!) write a block of zeros that is the same number of bytes as the final array size. The writing will not work properly if you omit this step. But you do not need to create an array that size: you could loop writing out a buffer of zeros until enough had accumulated. Do not write extra data: there is no way in MATLAB of getting rid of the extra data once it is written.
Now, repeat:
fseek to ((row number minus 1) times (bytes per element)) from beginning of file.
fwrite() the content of the row, making sure to use the precision argument to control how the data is written, and making sure to use the "skip" option. The value of the skip should be ((total rows minus 1) times (bytes per element))
Go back to the next row.
This will not be fast at all. Every page that is being updated will have to be read by MATLAB, and MATLAB will have to do the modification in its internal buffers and write the results out again.
It is not possible at the MATLAB level to "leave holes" that you gradually fill in. And even if it were, MATLAB would still need to do the continual read/modify/write cycle.
  2 Comments
Jan
Jan on 26 Mar 2022
Edited: Jan on 26 Mar 2022
I've written an equivalent code. There was no problem, if I omit the step to writing zeros at first. The iterative expanding of the file is not expensive also, because the existing data are not rewritten. In my tests it is even slower to pre-allocate the file.
To my surprise there is no method to crop a file in Matlab, as you say. See FileExchange: FileResize
Walter Roberson
Walter Roberson on 26 Mar 2022
In MATLAB fseek beyond the end of a file does not work, at least historically.

Sign in to comment.

More Answers (1)

Jan
Jan on 26 Mar 2022
Edited: Jan on 26 Mar 2022
% Some test data storing the rows in different files:
nRow = 10;
nCol = 1e6;
for k = 1:nRow
[fid, msg] = fopen(sprintf('file%02d.bin', k), 'W');
assert(fid > 0, msg);
data = randi([0, 32767], nCol, 1, 'int16');
fwrite(fid, data, 'int16');
fclose(fid);
end
% *** Version 1: insert data in chunks into the file:
tic
% Create the output file:
[ofid, msg] = fopen(sprintf('matrix1.bin'), 'W');
assert(ofid > 0, msg);
% Pre-allocate the output file (not really needed):
width = 2; % Bytes per element
skip = (nRow - 1) * width;
fwrite(ofid, 0, 'int16', (nRow * nCol - 1) * width);
% Loop over input files:
for k = 1:nRow
[ifid, msg] = fopen(sprintf('file%02d.bin', k), 'r');
assert(ifid > 0, msg);
data = fread(ifid, Inf, '*int16');
fclose(ifid);
% Insert in output file in chunks:
fseek(ofid, (k-1) * width, 'bof');
fwrite(ofid, data(1), 'int16');
fseek(ofid, k * width, 'bof');
fwrite(ofid, data(2:nCol), 'int16', skip);
end
fclose(ofid);
toc;
% *** Version 2: Join array in the memory:
tic
% Loop over input files:
data = zeros(nRow, nCol, 'int16');
for k = 1:nRow
[ifid, msg] = fopen(sprintf('file%02d.bin', k), 'r');
assert(ifid > 0, msg);
data(k, :) = fread(ifid, Inf, '*int16');
fclose(ifid);
end
% Write output file at once:
[ofid, msg] = fopen(sprintf('matrix2.bin'), 'W');
assert(ofid > 0, msg);
fwrite(ofid, data, 'int16');
fclose(ofid);
toc;
Timings on my i5, Matlab R2018b, SSD:
Elapsed time is 46.099363 seconds. % Insert on disk
Elapsed time is 0.060289 seconds. % Insert in memory
This means, that the joining in the RAM is much faster than writing the data with skipping.
This might be different, if you convert the imported data to doubles, which use 8 byte per element instead of 2 bytes for int16. Maybe the available RAM is exhausted and the computer stores the data in the much slower virtual memory.

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!