MATLAB Answers

0

Parfor loop with mex-file call crashes all workers on one computer, but runs fine on others

Asked by Bernt Nilsson on 4 Jul 2019
Latest activity Commented on by Jan
on 4 Dec 2019 at 12:32
Hello!
We have a code which runs either in serial or parallel mode. In the part under inspection one of our mex-files is run on either the complete set of data (serial operation) or on a part of the data consistent with the current number of workers (parallel operation). When run in serial mode the code works fine, but when run in parallel mode on a 6-core computer (with 2, 4 or 6 workers) the workers crash with messages like:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining
workers.
> In distcomp.remoteparfor/handleIntervalErrorResult (line 240)
In distcomp.remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In foo3 (line 138)
In foo2 (line 142)
In foo1 (line 61)
In foo (line 166)
A little later sometimes we get:
Error using distcomp.remoteparfor/rebuildParforController (line 194)
All workers aborted during execution of the parfor loop.
Error in distcomp.remoteparfor/handleIntervalErrorResult (line 253)
obj.rebuildParforController();
Error in distcomp.remoteparfor/getCompleteIntervals (line 387)
[r, err] = obj.handleIntervalErrorResult(r);
...
The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might
have errored.
Warning: 4 worker(s) crashed while executing code in the current parallel pool. MATLAB will attempt to run the code again
on the remaining workers of the pool. View the crash dump files to determine what caused the workers to crash.
The crash dumps don't say a lot, but conclude with:
This error was detected while a MEX-file was running. If the MEX-file
is not an official MathWorks function, please examine its source code
for errors. Please consult the External Interfaces Guide for information
on debugging MEX-files.
When run on three other computers the code works fine, in both serial and parallel mode. Two computers with 6-core CPU:s and a notebook with a 2-core CPU. It is possible to create different size pools, and the resulting output is always as expected. I have tried the code on all computers using the same number of workers (4) where possible.
This leads me to believe the mex-file is correct.
I am at a loss concerning what to try next, and would appreciate any hints on how to move forward.

  7 Comments

Jan, that is a good suggestion, and I actually tried running the code in "parallel" mode, i.e. the m-file with the parfor statement, and just changed parfor to for (like you say) - leaving the code for allocating the parallel pool intact. The pool takes some time to create, but after this the code runs fine on the problem PC.
I am not sure what this tells me however, or rather how to move on with the debugging. Perhaps i could try slicing the problem differently. It is not sliced exactly as in my simple example. I have 3D matrices where I slice along the third dimension, but I haven't read anywhere that this is not allowed. Perhaps this plays havoc with the mex-file? Hm, but why not on the other PC:s then.
I'm having this problem as well.
Could it occur because of uninitialized variables in the mex C function (I don't think i have any uninitialized pointers)?
For example, my C codes look something like
void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
/* DECLARATIONS, INPUTS */
double *Gphase = mxGetPr(prhs[0]);
double *dGdC = mxGetPr(prhs[2]);
double *StrainEM = mxGetPr(prhs[4]);
double R = mxGetScalar(prhs[5]);
int32_t p,c,ci,j;
for (j=0; j<1000; j++) {
somefunc(Gphase,dGdC,StrainEM,R,p,c,ci);
}
}
Could p,c,ci,j variables be screwing things up?
On a 28 core CPU I quickly lose half my workers, followed by a random but gradual elimination of further workers.
I get the error
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining workers.
and the worker that takes over complete the task without problems, and the functions are all fully deterministic, so the workers are just dying for some other problem
It depends on what happens inside somefunc(). The output of mxGetPr() should be treated as const pointer, so do you modify the contents? p, c, and ci are declared, but not initialized - do you use them correctly? Maybe "something like" conceals the actual problem. Please post the relevant part of the real code.

Sign in to comment.

0 Answers