We have a code which runs either in serial or parallel mode. In the part under inspection one of our mex-files is run on either the complete set of data (serial operation) or on a part of the data consistent with the current number of workers (parallel operation). When run in serial mode the code works fine, but when run in parallel mode on a 6-core computer (with 2, 4 or 6 workers) the workers crash with messages like:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining
> In distcomp.remoteparfor/handleIntervalErrorResult (line 240)
In distcomp.remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In foo3 (line 138)
In foo2 (line 142)
In foo1 (line 61)
In foo (line 166)
A little later sometimes we get:
Error using distcomp.remoteparfor/rebuildParforController (line 194)
All workers aborted during execution of the parfor loop.
Error in distcomp.remoteparfor/handleIntervalErrorResult (line 253)
Error in distcomp.remoteparfor/getCompleteIntervals (line 387)
[r, err] = obj.handleIntervalErrorResult(r);
The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might
Warning: 4 worker(s) crashed while executing code in the current parallel pool. MATLAB will attempt to run the code again
on the remaining workers of the pool. View the crash dump files to determine what caused the workers to crash.
The crash dumps don't say a lot, but conclude with:
This error was detected while a MEX-file was running. If the MEX-file
is not an official MathWorks function, please examine its source code
for errors. Please consult the External Interfaces Guide for information
on debugging MEX-files.
When run on three other computers the code works fine, in both serial and parallel mode. Two computers with 6-core CPU:s and a notebook with a 2-core CPU. It is possible to create different size pools, and the resulting output is always as expected. I have tried the code on all computers using the same number of workers (4) where possible.
This leads me to believe the mex-file is correct.
I am at a loss concerning what to try next, and would appreciate any hints on how to move forward.