Why is batch() so slow?

20 views (last 30 days)
Matthias
Matthias on 16 Dec 2014
Commented: Edric Ellis on 17 Dec 2014
I'm trying to use batch() to load some data from a slow disk in the background, but it is extremely slow. See code example with timings below. I think it is slower than what can be explained by the overhead of communicating with the worker (consider that I am not even transferring the loaded data from the worker to the client in the example).
>> a = rand(512, 512, 1000);
>> save('a');
>> tic; load('a'); toc
Elapsed time is 5.574926 seconds.
>> tic; b = batch(@load, 1, {'a'}); toc; tic; wait(b); toc;
Elapsed time is 0.444297 seconds.
Elapsed time is 41.229590 seconds.
You can see that the time until the batch job is done is more than 35 s longer than the same operation on the client. This is not because a new Matlab worker has to be started -- in my example, a worker was already running (if no worker were running, the batch(...) command itself would take longer, not the wait(b)).
Where does this overhead come from? How can I avoid it? (I also tried parfeval, but parfeval is plagued by a memory leak that makes it unusable -- confirmed as a known bug by MathWorks).
Thanks, Matthias
  2 Comments
Matthias
Matthias on 16 Dec 2014
Edited: Matthias on 16 Dec 2014
Even more bizarrely, if I right-click on the finished job in the Job Monitor and select Show Details, the displayed report indicates that the running duration of the job is 6 seconds. That's the same as the time it took on the client session. What happens in those 35 remaining seconds?
(I got this result on two different machines. Both running 2014b, however.)
Matthias
Matthias on 16 Dec 2014
Some more data:
>> disp(datestr(now, 'HH:MM:SS:FFF')); ...
b = batch(@batchTest, 1); ...
disp(datestr(now, 'HH:MM:SS:FFF')); ...
wait(b); ...
disp(datestr(now, 'HH:MM:SS:FFF'));
21:18:35:124
21:18:35:934
21:19:17:319
>> diary(b)
--- Start Diary ---
21:18:40:762
21:18:46:237
--- End Diary ---
Function batchTest:
function a = batchTest
disp(datestr(now, 'HH:MM:SS:FFF'));
load('a');
disp(datestr(now, 'HH:MM:SS:FFF'));
This shows that after executing the batch(...) command, ~5 s pass before the worker starts executing batchTest(). The worker is done executing batchTest() after another ~6 s, and hence executes that function just as fast as the clients. Then, another >30 s pass before wait(...) returns.
What happens in this time? Maybe the initial 5 s have to do with setting up the environment on the worker. But the 30 s after the job is done?

Sign in to comment.

Accepted Answer

Edric Ellis
Edric Ellis on 16 Dec 2014
Firstly, if you're using the local cluster type, then the batch command absolutely does need to launch the worker MATLAB process - it is not already running - you can verify this using Task Manager or similar. (Clusters of type MJS keep the workers running). The time for the batch command is simply the time needed to create the parallel.Job and parallel.Task objects needed for running the batch job, and saving those to disk.
Roughly speaking, the time taken to execute submitting and waiting for the results can be broken down like this:
  1. Time taken to create and submit the batch job to the scheduler
  2. Time taken to launch the worker process (unless you're using MJS)
  3. Time taken for the worker to load the job and task information
  4. Time for the worker to actually run the task
  5. Time for the worker to save the task results to disk (or database for MJS)
I suspect that the "missing" time is probably largely related to item 5 in the list above - as you've written it, the 512x512x1000 array is returned by your task function @load, and this result gets saved to disk.
How long does your save('a') command take? I suspect item 5 would take at least that long.
Note that there are several additional properties on the job object that can help you work out what's going on - see the reference page. In particular, note CreateTime, SubmitTime, StartTime, and FinishTime. The underlying task object has the same properties (except SubmitTime).
  10 Comments
Matthias
Matthias on 16 Dec 2014
Edited: Matthias on 16 Dec 2014
The bugfix removes the memory leak! Thanks a lot!
However, loading in the background with parfeval still doesn't work as intended: Parfeval may not block the client Matlab instance, but it apparently does block other parallel functions. See this example:
fprintf('Start: %s\n', datestr(now, 'HH:MM:SS:FFF'));
f = parfeval(@pause, 0, 10);
fprintf('Outside parfor: %s\n', datestr(now, 'HH:MM:SS:FFF'));
parfor i = 1:10
fprintf('Inside parfor: %s\n', datestr(now, 'HH:MM:SS:FFF'));
end
wait(f);
fprintf('End: %s\n', datestr(now, 'HH:MM:SS:FFF'));
Output:
Start: 14:50:45:204
Outside parfor: 14:50:45:219
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:297
Inside parfor: 14:50:55:312
Inside parfor: 14:50:55:312
Inside parfor: 14:50:55:312
Inside parfor: 14:50:55:312
End: 14:50:55:328
The timings suggest that the execution works like this: 1. Parfeval sends jobs to one worker. 2. Parfor waits until all workers are available. 3. Parfor executes.
I had hoped that it would be more like this: 1. Parfeval sends job to one worker; then continues execution in main Matlab instance. 2. Parfor runs on whichever workers are available; parfeval continues to run on one worker until done.
Is the behavior I'm observing intended? Maybe I just didn't properly understand the way the parallel toolbox worked...right now, it seems frustratingly inflexible.
Edric Ellis
Edric Ellis on 17 Dec 2014
Unfortunately, as you observe, PARFOR cannot proceed while there are outstanding PARFEVAL requests (the same applies for SPMD). Your best option in this case is to recast your PARFOR loop as a series of PARFEVAL requests.

Sign in to comment.

More Answers (0)

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!