Matlab slows down when writing to large file

Question

Alex on 8 May 2012

0
Link

Direct link to this question

https://nl.mathworks.com/matlabcentral/answers/37745-matlab-slows-down-when-writing-to-large-file

Dear all, I am using matlab to control an external device, extract measurements from it and append the results to a file.

I have found by using tic toc the matlab performance degrades as the file size increases.

The device returns every second 32.000 measurements with a double precision.

For reducing the written size the double is further converted to single. After 3-4 days of non stop measurements the file is around 22Gb and it takes more and more time for the file to be written.

The used matlab is R2010b and is running on a 32bit system. Surprisingly is that the cpu and the ram of the system look constant and quite underutilized (cpu constantly at 24% and ram has always 2+Gb free)

The file is only opened once to append mode and every second a fwrite is issued

when the process ends after few days a fclose is executed.

3 Comments
Show 1 older commentHide 1 older comment

Jan on 8 May 2012

While the explanations of Jason look very reasonably, I can still imagine, that there might be further reasons for the speed reduction. Without seeing the code and the profiler log data it impossible to check this.

Jason Ross on 8 May 2012

Thanks, and I completely concur, Jan.

Sign in to comment.

Sign in to answer this question.

Answer 1

Jason Ross on 8 May 2012

2
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/37745-matlab-slows-down-when-writing-to-large-file#answer_47096

You are waiting on disk I/O in this case. I'm not surprised that performance drops off when the file reaches 22 GB. That's a pretty large file!

If you want better performance, I have two suggestions:

Refactor your code to create smaller files (think write 1 file an hour) and do the aggregation later, or process the data in a manner that lets you pull in a collection of files (or date range). This has a number of side benefits, one of which is that you'll not suffer the performance degradation, since you'll end up with a collection of files of a more manageable size. Lots of stuff gets easier, like prototyping on an hour's work of data rather than trying to pull in a 22 GB file, which may not even be possible on a 32-bit OS if you try and pull it all into memory. I would assume you are doing some sort of scanning of the file? The other more important one is that you won't lose 3-4 days of data when something goes wrong on day three -- like a power failure. You'll also be able to scale your experiment as far as you have disk space.
Purchase a better disk controller (SATA 6 GB/s) and a SSD. This might help for a while but the first suggestion will help you forever.

3 Comments
Show 1 older commentHide 1 older comment

Walter Roberson on 8 May 2012

The problem there is in the "pointer": pointers are 32 bit on 32 bit operating systems, and so only at most 4 Gb is supported. Anything beyond 4 Gb on a 32 bit OS has to use simulated pointers.

Which OS are you using? A 32 bit MS Windows system, or a 32 bit Linux system? (If my memory serves me correctly, R2010b was not supported on the versions of Mac OS-X that were 32 bit.)

Jason Ross on 8 May 2012

Sure, it might. But keep in mind that when you add a new entry there are likely other calculations that get made by the OS to assure the data got written. As the file size increases, those checks take longer.

But doing pretty much anything with a 22GB file is going to be a pain and / or impossible, especially on a 32-bit OS. Let's say, for example, you aren't able to get the data to come in correctly as you scan it and you want to examine the file. Opening a 22 GB file in a text editor on Windows may or may not work (unless you install something like the UNIX tool "more" or some other utility that can handle displaying little bits of files). Moving it becomes a pain. And trying an iterative approach to anything puts your iteration time at 3-4 days.

In contrast, if you have files that have an hour's worth of data (rough estimate -- 200 MB each), you can open those with "normal" tools and examine them. If you find a bug in your code, you can debug far more easily and know in an hour that you got rid of the bug. And when the power goes out, you get only one corrupt file and have all the remaining data, which is likely not as big of a deal as losing 3-4 days of work.

Also, this approach lets you increase your experimental time as long as you have disk space to keep the data. With the approach you have now, dealing with a monolithic file of 40 or 60 GB just gets more and more painful.

Sign in to comment.

Answer 2

per isakson on 9 May 2012

2
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/37745-matlab-slows-down-when-writing-to-large-file#answer_47280

Open in MATLAB Online

Writing to a HDF5 file is one possibility

With the attached function, huge2hdf, I have created one 21.6GB hdf5-file. Each of 171875 vectors, .<32000x1 single>, is written to the file as a "separate item", dataset, (see the code).

My system is: R2012a, Windows7, 64bit, a three years old Dell Optilex 960, 4 core CPU, 8GB Ram, HD?. The CPU-usage varied in the interval 20% to 45% according to the task manager. (I have no idea whether this would work on a system with R2010b and 32bit.)

The total execution time was one hour. That is on average 6MB per second. (Approximately five percent of the time was spent with creation of the random data.) The time to write increased slowly with the size of the file. It started with 1.7 second per 100 .<32000x1 single> and increased to 2.8 second per 100 .<32000x1 single> at the end. (And a few outliers.)

Reading data from the hdf5-file caused no problems

>> tic, z=h5read( FileSpec, '/data011302' ); toc
Elapsed time is 0.004931 seconds.
>> tic, z=h5read( FileSpec, '/data000312' ); toc
Elapsed time is 0.004562 seconds.

and the expected data was returned!

One more experiment 2012-05-10

I modified the experiment. Instead of writing to 171875 datasets, .<32000x1 single>, I wrote to one dataset .<32000x171875 single> one column at a time with the function, huge2hdf_lenxinf.

The total execution time was less than half an hour. Thus, the speed was more than twice as high as in the first experiment. That makes 100 .<32000x1 single> per second including opening and closing the single file (underneath the hood) in every call.

One more variant. One dataset .<32000*171875x1 single> and add one .<32000x1 single> at a time with the function, huge2hdf_infx1. The total execution time was less than half an hour. The result is close to that of the previous experiment, .

--- attachment ---

    function    timing = huge2hdf( N )       
        FileSpec = 'c:\temp\huge2hdf5.hdf';
        if exist( FileSpec, 'file' )
            delete( FileSpec )
        end
        len_time = 32*1e3;
        if isinf( N )
            n_qty = round( 22*1e9/(len_time*4) );
        else
            n_qty = N;
        end
        timing  = nan( 2, ceil( n_qty / 100 ) );
        jj      = 0;
        ticID   = tic;
        for ii = 1 : n_qty
            if rem( ii, 100 ) == 0,   
                jj = jj+1; 
                timing(:,jj) = [ ii; toc( ticID ) ];
            end
            qty = sprintf( 'data%06u', ii );
            val = single(ii) + rand( len_time, 1, 'single' );
            h5create(   FileSpec                ...
                    ,   ['/',qty]               ...
                    ,   [ len_time, 1 ]         ...
                    ,   'Datatype', 'single'    ...  
                    )
            h5write(    FileSpec,  ['/',qty], val )
        end
    end
    function    timing = huge2hdf_lenxinf( N )       
        FileSpec = 'c:\temp\huge2hdf5_lenxinf.hdf';
        if exist( FileSpec, 'file' )
            delete( FileSpec )
        end
        len_time = 32*1e3;
        if isinf( N )
            n_qty = round( 22*1e9/(len_time*4) );
        else
            n_qty = N;
        end
        timing  = nan( 2, ceil( n_qty / 100 ) );
        jj      = 0;
        ticID   = tic;
        qty = 'data_lenxinf';
        h5create(   FileSpec                    ...
                ,   ['/',qty]                   ...
                ,   [ len_time, inf ]           ...
                ,   'Datatype', 'single'        ...  
                ,   'ChunkSize', [len_time,1]   ...
                )
        for ii = 1 : n_qty
            if rem( ii, 100 ) == 0,   
                jj = jj+1; 
                timing(:,jj) = [ ii; toc( ticID ) ];
            end
            val = single(ii) + rand( len_time, 1, 'single' );
            h5write( FileSpec,  ['/',qty], val, [ 1, ii ], [ len_time, 1 ] )
        end
    end
    function    timing = huge2hdf_infx1( N )       
        FileSpec    = 'c:\temp\huge2hdf5_infx1.hdf';
        <snip>    
        h5create(   FileSpec                    ...
                ,   ['/',qty]                   ...
                ,   [ inf, 1 ]                  ...
                ,   'Datatype', 'single'        ...  
                ,   'ChunkSize', [len_time,1]   ...
                )
        <snip>        
            h5write(FileSpec,['/',qty],val,[1+(ii-1)*len_time,1],[len_time,1])
    end

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Answer 3

Daniel Shub on 9 May 2012

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/37745-matlab-slows-down-when-writing-to-large-file#answer_47197

I am going to guess that a 64 bit OS is not going to fix your problems. Opening and closing files takes time (not a lot, but measurable). If you are stressing your system so much that it cannot open an additional file every hour, my guess is your program will crash well before 4 days have elapsed. You can open all the files (one for each hour) prior to starting the experiment and close them all after the experiment. This is not as robust as opening and closing them when needed, but it is no more risky then what you are doing now.

1 Comment
Show -1 older commentsHide -1 older comments

Walter Roberson on 9 May 2012

Caution: you might not be able to open more than 16 files (!!) but that should only happen for very old OS. 250 files per process can be a serious limit in MS Windows.

Sign in to comment.

Answer 4

Alex on 9 May 2012

0
Link

Direct link to this answer

https://nl.mathworks.com/matlabcentral/answers/37745-matlab-slows-down-when-writing-to-large-file#answer_47203

(I am sorry for commenting instead of replying before=

Thanks I will try. a. Do you think that 64bit windows might alleviate the problem? b. As I said my equipment needs approximately 1 sec to return the 32.000 numbers. Do you think that closing the file and appending to a new one this might increase by lot the time needed for closing the one file and then writing to a new one?

Right now the code looks like

fopen

loop hundred thousands of time fwrite (in append mode)

end loop

fclose

c. I was also thinking if there is an option in matlab to automatically split to new files (with some used file pattern name) so to not need to change drastically my code.

Regards Alex

2 Comments
Show NoneHide None

Walter Roberson on 9 May 2012

There is no option in MATLAB to automatically split to new files when the file it is writing to "gets too large".

There are programs available that can split binary files (e.g., Unix "split") or text files (e.g., Unix "csplit"), but if you have one single file open for writing in MS Windows, MS Windows will consider it "locked" and will not allow other programs to write to the file (to remove the part that has already been split into another file.) One has to write a program carefully if the size of the file might change "underneath" you, and MATLAB misses out on offering one of the key file I/O operators needed for that (the ability to flush a buffer.)

To avoid having to change your code much, you could write a small routine such as

function fid = nextfile(fid)

persistent file_count

if isempty(file_count) file_count = 0; end

need_new = fid < 0;

if ~need_new

P = ftell(fid);

need_new = P >= 2^24; %arbitrary size

end

if fid >= 0; fclose(fid); end

if need_new

file_count = file_count + 1;

newfile = sprintf('MyOutputFile_%d.dat', file_count);

fid = fopen(newfile, 'w);

end

To start the process off, call the routine with a negative argument.

Jason Ross on 9 May 2012

To answer your "will a 64-bit operating system help", the answer is "no, I don't think so". Reasons why, relating to your problem:

- At present you have observed that you are not getting "out of memory" errors. If you were, the larger address space of a 64-bit OS would be beneficial, for instance if you wanted to just open the file ... but that leads to ....

- A 64-bit OS is still going to have to spend time figuring out how to commit to disk, and your file is still simply too large to really deal with efficiently. Although you would be able to do something like pull it all into memory, pulling a 22 GB file of disk is going to take (theoretically, if you could have your computer do nothing but I/O from disk on a 6Gbps disk controller) 30 seconds. Realistically, it's going to take considerably longer, as your computer is doing other things that involve disk i/o and the processor all the time, in addition to the fact that there is some level of processing and display of the file, as well. It wouldn't surprise me if you would mistake the load time as a process hang.

This model of not having one big file is fairly common. For example, logs on UNIX machines are rotated on a regular basis so that they stay a reasonable size but still contain information, and they are also discarded automatically, as well.

I had suggested hourly only as a suggestion -- it might make more sense for your application to start up a new log based on some other metric that could make your downstream data processing more convenient -- number of samples taken, sunrise/sunset, etc.

Sign in to comment.

Matlab slows down when writing to large file

3 Comments
Show 1 older commentHide 1 older comment

Answers (4)

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

Matlab slows down when writing to large file

3 Comments Show 1 older commentHide 1 older comment

Answers (4)

3 Comments Show 1 older commentHide 1 older comment

0 Comments Show -2 older commentsHide -2 older comments

1 Comment Show -1 older commentsHide -1 older comments

2 Comments Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

3 Comments
Show 1 older commentHide 1 older comment

3 Comments
Show 1 older commentHide 1 older comment

0 Comments
Show -2 older commentsHide -2 older comments

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None