Memory Usage and block proc

Hi,
Consider this command:
blockproc(inputImage.tif,blockSize,function_handle,
'UseParallel',true,
'Destination',BigTiff_Adapter);
As the function suggest the input image is a huge tiff image (44GB+). So, no way, I can load it onto memory on a machine that has 16GB memory in total.
The output of the function_handle has the same number of rows and columns of the input except it has only two layers of single precision value, so the output is a single precision floating point number which defines the computed value based on the input image. as I mentioned the input image would be 44GB+ 4 band Tiff image, where each band is uint8. Therefore, the output of the entire image would be twice that, i.e. 88GB+, (2 band x 32bits per pixel per band). So, it is also not possible to get the output as one matrix on a machine with total of 16GB memory.
Since, these are geolocated, I need to store it as tiff, and as the size of the image is too big I definitely need to write it as BigTiff; hence, I am using a customized BigTiff_Adapter. "blockSize" is set to tile size, (my input tiff image is tiled, 256x256 pixels per tile). So, pretty much one tile is loaded, processed via function_handle and then written to the output BigTiff tile by tile.
So, here is the question that I don't get it. Once I set "UseParallel" to false; everything works just fine. except that it takes a long time. You would think that using parallel proc should improve that. However, once I set the " UseParallel " to true, all my memory is used and it seems that it takes even slower to compute compared to the time that " UseParallel " is set to false. Literally, the serial version seems to be much faster.
If you are thinking of the communication cost, don't bother. The computation in the function_handle is so easy and it only needs the data within one pixel. So no communication at all. Let's say the the output pixel o(i,j)=K*reshape(I(i,j,:),[],1), where o(i,j) is the pixel on i-th row and j-th column of the output, K is a 1x3 matrix, and reshape(I(i,j,:),[],1) is a column vector of size 3x1 . As you can see, I don't need any information from the neighboring block or even neighboring pixels. So, absolutely no inter communication (seems a heaven for parallel proc). All this said, It is not the communication problem that slows down the parallel version, or better to say that the communication is not within function_handle (I don't know what MATLAB does under the hood).
Any idea why turning UseParallel increases the memory usage and causing slower calculation? On some machine I even get java.lang.OutOfMemoryError.
I have to add that once I use an image of let's say 8 or 9GB everything works just fine, both parallel set to true or false, doesn't produce any problem. But when I track the code, it appears that the memory usage goes up, it appears that the entire image seems to be loaded. I am controlling the number of workers, it changes between 4 and 12. If each block is assigned to one worker, then 12 block of 256x256x4 must be loaded at any time, i.e. 3MB, my output should be twice that so the output should be 6MB, and There are not that many intermediate variable during computation, but let's say they also take another 24MB. So, practically I shouldn't see more than 40MB memory usage, but I see at least 5 6 GB of memory usage.
So, what's going on?

 Accepted Answer

Meshooo
Meshooo on 21 Aug 2014
That's too much to read. Can you make it shorter?

1 Comment

That is a work around. But not really solving the problem. At the end we want the whole thing in one file, so, if I split the image, process each part and then recombine them into one image, I would end up with more reading/writing to the disk. (and these disks are network mounted, so imagine how much slower it would be). I thought, block proc pretty much does this for you, but as Ashish has explained the problem is not with reading part but the writing part, since the write in parallel is not supported.

Sign in to comment.

More Answers (1)

Ashish Uthama
Ashish Uthama on 21 Aug 2014
blockproc will not read the full image, it ought to only read one tile each. However, since your file system is serial, the final write will have to have to be done in a serial fashion to your storage device. If you have large number of workers, processing each block in short time, then the one process which is trying to serialize this to the final output could be the main bottleneck. In this extreme case, the problem is IO bound rather than compute bound, so parallel processing might actually hurt. I know this explanation might not really 'help'.
Is there more processing you need to do on this file? You might see a parallel advantage if you can combine all your future processing into one block processing function. That way you might even out the IO/Compute balance.

5 Comments

Thank you Ashish for your explanation. This actually helps.
So, if I understood it properly, since TIFF does not support parallel access to write, at the end, only one process is writing all the blocks to the disk and that is perhaps causing the bottleneck.
This might answer why setting UseParallel to false improves the performance. But my question about memory usage remains. In both case of UseParallel (both true and false) the memory usage seems to be higher than what it should be.
In case of UseParallel=false, there should be only one block of image read, processed and then written to the disk. Since, the blocks are about 256x256 pixels, that accounts for less than 1MB of data. While I see that even in the case of UseParallel=false, the memory consumption goes up by couple of GBs.
Does, blockProc in case of UseParallel=false, finish with writing one block to the disk, before reading another block for processing? or it gathers bunch of them together and every now and then writes whole bunch of blocks to the disk? if it does the second that explains the memory usage, but if it does the first, I still don't get it why the memory usage goes up this much.
In the serial mode, one block is read/processed/written before moving to the second. In the parallel mode, workers can read/process blocks as fast as they can, and pipe the result to write to the single stitching process. Now, the pipe line will allow for buffering (as it should), so if the stitching process is slow, I assume the pipeline getting backed up is potentially what you are seeing as a memory hog. This is a guess at this point.
(You might have considered this, but just in case - when you set useparallel true, we actually use multiple 'workers'. These are actually separate MATLAB instances, so 12 workers implies 12 MATLAB processes running... and their associated memory footprint).
Ah, I missed the part about memory bloat in serial mode. Here, I think the big tiff adapter is the first place I would check. Are your blockproc block sized the same as the internal tiling in your Tiff file? If I remember right, the Tiff library will have to load a full tile even if you requested just a part of it. i.e if you file is organized as one huge tile, then a subset read might not be as efficient. I hope your internal Tiff tile size is the same as your blockproc size.
I read one tile at a time. So the entire tile is read, but they are small, 256x256.
Based on your instruction I checked the memory during serial run of block proc, i.e. setting UseParallel=false. Although the overall system memory usage goes up, but the instances of matlab stay about 70MB, I tested this using 4 workers. all 4 workers were about 70MB and the master one was about 1GB, but fluctuating a lot (+-200MB).
That makes sense to me. The overall system memory uptick might be the OS caching some parts of the file (read ahead?). In the parallel case, the master worker is the 'stitcher', so the increase and fluctuations make sense - its probably buffering completed tiles while waiting for the disk output to go through. (I hope you already use SSD's/RAID array or have the budget for it!)

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!