Matlab slows down when writing to large file

9 views (last 30 days)
Alex
Alex on 8 May 2012
Dear all, I am using matlab to control an external device, extract measurements from it and append the results to a file.
I have found by using tic toc the matlab performance degrades as the file size increases.
The device returns every second 32.000 measurements with a double precision.
For reducing the written size the double is further converted to single. After 3-4 days of non stop measurements the file is around 22Gb and it takes more and more time for the file to be written.
The used matlab is R2010b and is running on a 32bit system. Surprisingly is that the cpu and the ram of the system look constant and quite underutilized (cpu constantly at 24% and ram has always 2+Gb free)
The file is only opened once to append mode and every second a fwrite is issued
when the process ends after few days a fclose is executed.
  3 Comments
Jan
Jan on 8 May 2012
While the explanations of Jason look very reasonably, I can still imagine, that there might be further reasons for the speed reduction. Without seeing the code and the profiler log data it impossible to check this.
Jason Ross
Jason Ross on 8 May 2012
Thanks, and I completely concur, Jan.

Sign in to comment.

Answers (4)

Jason Ross
Jason Ross on 8 May 2012
You are waiting on disk I/O in this case. I'm not surprised that performance drops off when the file reaches 22 GB. That's a pretty large file!
If you want better performance, I have two suggestions:
  1. Refactor your code to create smaller files (think write 1 file an hour) and do the aggregation later, or process the data in a manner that lets you pull in a collection of files (or date range). This has a number of side benefits, one of which is that you'll not suffer the performance degradation, since you'll end up with a collection of files of a more manageable size. Lots of stuff gets easier, like prototyping on an hour's work of data rather than trying to pull in a 22 GB file, which may not even be possible on a 32-bit OS if you try and pull it all into memory. I would assume you are doing some sort of scanning of the file? The other more important one is that you won't lose 3-4 days of data when something goes wrong on day three -- like a power failure. You'll also be able to scale your experiment as far as you have disk space.
  2. Purchase a better disk controller (SATA 6 GB/s) and a SSD. This might help for a while but the first suggestion will help you forever.
  3 Comments
Walter Roberson
Walter Roberson on 8 May 2012
The problem there is in the "pointer": pointers are 32 bit on 32 bit operating systems, and so only at most 4 Gb is supported. Anything beyond 4 Gb on a 32 bit OS has to use simulated pointers.
Which OS are you using? A 32 bit MS Windows system, or a 32 bit Linux system? (If my memory serves me correctly, R2010b was not supported on the versions of Mac OS-X that were 32 bit.)
Jason Ross
Jason Ross on 8 May 2012
Sure, it might. But keep in mind that when you add a new entry there are likely other calculations that get made by the OS to assure the data got written. As the file size increases, those checks take longer.
But doing pretty much anything with a 22GB file is going to be a pain and / or impossible, especially on a 32-bit OS. Let's say, for example, you aren't able to get the data to come in correctly as you scan it and you want to examine the file. Opening a 22 GB file in a text editor on Windows may or may not work (unless you install something like the UNIX tool "more" or some other utility that can handle displaying little bits of files). Moving it becomes a pain. And trying an iterative approach to anything puts your iteration time at 3-4 days.
In contrast, if you have files that have an hour's worth of data (rough estimate -- 200 MB each), you can open those with "normal" tools and examine them. If you find a bug in your code, you can debug far more easily and know in an hour that you got rid of the bug. And when the power goes out, you get only one corrupt file and have all the remaining data, which is likely not as big of a deal as losing 3-4 days of work.
Also, this approach lets you increase your experimental time as long as you have disk space to keep the data. With the approach you have now, dealing with a monolithic file of 40 or 60 GB just gets more and more painful.

Sign in to comment.


per isakson
per isakson on 9 May 2012
Writing to a HDF5 file is one possibility
With the attached function, huge2hdf, I have created one 21.6GB hdf5-file. Each of 171875 vectors, .<32000x1 single>, is written to the file as a "separate item", dataset, (see the code).
My system is: R2012a, Windows7, 64bit, a three years old Dell Optilex 960, 4 core CPU, 8GB Ram, HD?. The CPU-usage varied in the interval 20% to 45% according to the task manager. (I have no idea whether this would work on a system with R2010b and 32bit.)
The total execution time was one hour. That is on average 6MB per second. (Approximately five percent of the time was spent with creation of the random data.) The time to write increased slowly with the size of the file. It started with 1.7 second per 100 .<32000x1 single> and increased to 2.8 second per 100 .<32000x1 single> at the end. (And a few outliers.)
Reading data from the hdf5-file caused no problems
>> tic, z=h5read( FileSpec, '/data011302' ); toc
Elapsed time is 0.004931 seconds.
>> tic, z=h5read( FileSpec, '/data000312' ); toc
Elapsed time is 0.004562 seconds.
and the expected data was returned!
One more experiment 2012-05-10
I modified the experiment. Instead of writing to 171875 datasets, .<32000x1 single>, I wrote to one dataset .<32000x171875 single> one column at a time with the function, huge2hdf_lenxinf.
The total execution time was less than half an hour. Thus, the speed was more than twice as high as in the first experiment. That makes 100 .<32000x1 single> per second including opening and closing the single file (underneath the hood) in every call.
One more variant. One dataset .<32000*171875x1 single> and add one .<32000x1 single> at a time with the function, huge2hdf_infx1. The total execution time was less than half an hour. The result is close to that of the previous experiment, .
--- attachment ---
function timing = huge2hdf( N )
FileSpec = 'c:\temp\huge2hdf5.hdf';
if exist( FileSpec, 'file' )
delete( FileSpec )
end
len_time = 32*1e3;
if isinf( N )
n_qty = round( 22*1e9/(len_time*4) );
else
n_qty = N;
end
timing = nan( 2, ceil( n_qty / 100 ) );
jj = 0;
ticID = tic;
for ii = 1 : n_qty
if rem( ii, 100 ) == 0,
jj = jj+1;
timing(:,jj) = [ ii; toc( ticID ) ];
end
qty = sprintf( 'data%06u', ii );
val = single(ii) + rand( len_time, 1, 'single' );
h5create( FileSpec ...
, ['/',qty] ...
, [ len_time, 1 ] ...
, 'Datatype', 'single' ...
)
h5write( FileSpec, ['/',qty], val )
end
end
function timing = huge2hdf_lenxinf( N )
FileSpec = 'c:\temp\huge2hdf5_lenxinf.hdf';
if exist( FileSpec, 'file' )
delete( FileSpec )
end
len_time = 32*1e3;
if isinf( N )
n_qty = round( 22*1e9/(len_time*4) );
else
n_qty = N;
end
timing = nan( 2, ceil( n_qty / 100 ) );
jj = 0;
ticID = tic;
qty = 'data_lenxinf';
h5create( FileSpec ...
, ['/',qty] ...
, [ len_time, inf ] ...
, 'Datatype', 'single' ...
, 'ChunkSize', [len_time,1] ...
)
for ii = 1 : n_qty
if rem( ii, 100 ) == 0,
jj = jj+1;
timing(:,jj) = [ ii; toc( ticID ) ];
end
val = single(ii) + rand( len_time, 1, 'single' );
h5write( FileSpec, ['/',qty], val, [ 1, ii ], [ len_time, 1 ] )
end
end
function timing = huge2hdf_infx1( N )
FileSpec = 'c:\temp\huge2hdf5_infx1.hdf';
<snip>
h5create( FileSpec ...
, ['/',qty] ...
, [ inf, 1 ] ...
, 'Datatype', 'single' ...
, 'ChunkSize', [len_time,1] ...
)
<snip>
h5write(FileSpec,['/',qty],val,[1+(ii-1)*len_time,1],[len_time,1])
end

Daniel Shub
Daniel Shub on 9 May 2012
I am going to guess that a 64 bit OS is not going to fix your problems. Opening and closing files takes time (not a lot, but measurable). If you are stressing your system so much that it cannot open an additional file every hour, my guess is your program will crash well before 4 days have elapsed. You can open all the files (one for each hour) prior to starting the experiment and close them all after the experiment. This is not as robust as opening and closing them when needed, but it is no more risky then what you are doing now.
  1 Comment
Walter Roberson
Walter Roberson on 9 May 2012
Caution: you might not be able to open more than 16 files (!!) but that should only happen for very old OS. 250 files per process can be a serious limit in MS Windows.

Sign in to comment.


Alex
Alex on 9 May 2012
(I am sorry for commenting instead of replying before=
Thanks I will try. a. Do you think that 64bit windows might alleviate the problem? b. As I said my equipment needs approximately 1 sec to return the 32.000 numbers. Do you think that closing the file and appending to a new one this might increase by lot the time needed for closing the one file and then writing to a new one?
Right now the code looks like
fopen
loop hundred thousands of time fwrite (in append mode)
end loop
fclose
c. I was also thinking if there is an option in matlab to automatically split to new files (with some used file pattern name) so to not need to change drastically my code.
Regards Alex
  2 Comments
Walter Roberson
Walter Roberson on 9 May 2012
There is no option in MATLAB to automatically split to new files when the file it is writing to "gets too large".
There are programs available that can split binary files (e.g., Unix "split") or text files (e.g., Unix "csplit"), but if you have one single file open for writing in MS Windows, MS Windows will consider it "locked" and will not allow other programs to write to the file (to remove the part that has already been split into another file.) One has to write a program carefully if the size of the file might change "underneath" you, and MATLAB misses out on offering one of the key file I/O operators needed for that (the ability to flush a buffer.)
To avoid having to change your code much, you could write a small routine such as
function fid = nextfile(fid)
persistent file_count
if isempty(file_count) file_count = 0; end
need_new = fid < 0;
if ~need_new
P = ftell(fid);
need_new = P >= 2^24; %arbitrary size
end
if fid >= 0; fclose(fid); end
if need_new
file_count = file_count + 1;
newfile = sprintf('MyOutputFile_%d.dat', file_count);
fid = fopen(newfile, 'w);
end
end
To start the process off, call the routine with a negative argument.
Jason Ross
Jason Ross on 9 May 2012
To answer your "will a 64-bit operating system help", the answer is "no, I don't think so". Reasons why, relating to your problem:
- At present you have observed that you are not getting "out of memory" errors. If you were, the larger address space of a 64-bit OS would be beneficial, for instance if you wanted to just open the file ... but that leads to ....
- A 64-bit OS is still going to have to spend time figuring out how to commit to disk, and your file is still simply too large to really deal with efficiently. Although you would be able to do something like pull it all into memory, pulling a 22 GB file of disk is going to take (theoretically, if you could have your computer do nothing but I/O from disk on a 6Gbps disk controller) 30 seconds. Realistically, it's going to take considerably longer, as your computer is doing other things that involve disk i/o and the processor all the time, in addition to the fact that there is some level of processing and display of the file, as well. It wouldn't surprise me if you would mistake the load time as a process hang.
This model of not having one big file is fairly common. For example, logs on UNIX machines are rotated on a regular basis so that they stay a reasonable size but still contain information, and they are also discarded automatically, as well.
I had suggested hourly only as a suggestion -- it might make more sense for your application to start up a new log based on some other metric that could make your downstream data processing more convenient -- number of samples taken, sunrise/sunset, etc.

Sign in to comment.

Categories

Find more on Variables in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!