MATLAB Answers

0

Parfor loop with mex-file call crashes all workers on one computer, but runs fine on others

Asked by Bernt Nilsson on 4 Jul 2019
Latest activity Commented on by Bernt Nilsson on 11 Jul 2019
Hello!
We have a code which runs either in serial or parallel mode. In the part under inspection one of our mex-files is run on either the complete set of data (serial operation) or on a part of the data consistent with the current number of workers (parallel operation). When run in serial mode the code works fine, but when run in parallel mode on a 6-core computer (with 2, 4 or 6 workers) the workers crash with messages like:
Warning: A worker aborted during execution of the parfor loop. The parfor loop will now run again on the remaining
workers.
> In distcomp.remoteparfor/handleIntervalErrorResult (line 240)
In distcomp.remoteparfor/getCompleteIntervals (line 387)
In parallel_function>distributed_execution (line 745)
In parallel_function (line 577)
In foo3 (line 138)
In foo2 (line 142)
In foo1 (line 61)
In foo (line 166)
A little later sometimes we get:
Error using distcomp.remoteparfor/rebuildParforController (line 194)
All workers aborted during execution of the parfor loop.
Error in distcomp.remoteparfor/handleIntervalErrorResult (line 253)
obj.rebuildParforController();
Error in distcomp.remoteparfor/getCompleteIntervals (line 387)
[r, err] = obj.handleIntervalErrorResult(r);
...
The client lost connection to worker 3. This might be due to network problems, or the interactive communicating job might
have errored.
Warning: 4 worker(s) crashed while executing code in the current parallel pool. MATLAB will attempt to run the code again
on the remaining workers of the pool. View the crash dump files to determine what caused the workers to crash.
The crash dumps don't say a lot, but conclude with:
This error was detected while a MEX-file was running. If the MEX-file
is not an official MathWorks function, please examine its source code
for errors. Please consult the External Interfaces Guide for information
on debugging MEX-files.
When run on three other computers the code works fine, in both serial and parallel mode. Two computers with 6-core CPU:s and a notebook with a 2-core CPU. It is possible to create different size pools, and the resulting output is always as expected. I have tried the code on all computers using the same number of workers (4) where possible.
This leads me to believe the mex-file is correct.
I am at a loss concerning what to try next, and would appreciate any hints on how to move forward.

  5 Comments

Thank you both for taking the time to look at my problem.
I didn't mention this in the first post, but the problem machine is actually two, identical, machines, and the results are the same on both, i.e. the code will not run in parallel mode on either. The three other machines run the code fine. All machines run R2019a rev 3, but the ones where it doesn't work use a different license, still with Parallel Computing Tb of course.
A very simple overview of the code is:
main.m:
...
if serial
output = foo_serial(input,...);
else
output = foo_parallel(input,...);
end
...
foo_serial.m:
...
output = mex_file(input);
...
foo_parallel.m:
...
% Slice Input data
M = N/numWorkers;
input_slc = zeros(M,numWorkers);
output_slc = zeros(M,numWorkers);
% Populate input_slc
...
% Run in parallel
parfor k = 1:numWorkers
output_slc(:,k) = mex_file(input_slc(:,k));
end
...
% Assemble output from slices
...
To sum up the situation: Two identical computers using Windows 10 run fine in serial mode, but when run in parallel mode all workers crash. I have tried 1, 2, 4 and 6 workers.
Three other computers, two 6-core ones running Windows 10 and openSUSE Linux and a 2-core laptop running Windows 10, run fine in both serial and parallel versions. Here I have tried 2, 4 and 6 workers, except on the two-core laptop which will only run 2 workers.
I have recompiled the mex-file on the troubled PC:s, but the results are still the same. And I actually did run a diagnostic for about 24 hours on one of the "failing" computers, but without errors. Not memtest86, but a diagnostic routine (part of which stress test memory) from HP.
The mex-file is built from 30+ C and FORTRAN routines. I didn't write any of them, so it would be difficult for me to debug them (which isn't to say I shouldn't). I said that I "believe" it is correct, but there could of course still be errors in it. And there is nothing random about the execution on the troubled PC:s, the workers crash every time.
From the crash dump files I get that the problem is "Access violation", which would suggest the problem is with the mex-file. But this access violation only happens in parallel use of the same mex-file, and only on two out of five computers.
Is there a way to debug the differences between the computers with and without problems? For example, list all (or any) dynamic link libraries used, and see if there are differences?
What happens if you run this in the serial mode:
% parfor k = 1:numWorkers
for k = 1:numWorkers
output_slc(:,k) = mex_file(input_slc(:,k));
end
You could do a minimal debugging inside the Mex functions by inserting some mexPrintf statements after the parts for importing the data from the Matlab variables and after the processing inside the MEX.
"the problem is "Access violation", which would suggest the problem is with the mex-file" - A MEX function can leave Matlab's memory manager in an inconsistent state, such that a crash can occur inside valid Matlab code also. "Access violation" can be caused by overwriting the pointers to the data of a variable, or by a dangeling pointer. Over 15 years ago, I could cause an Access violation with "pure Matlab" code also - well, to be correct it was the underlying library for sprintf. I confirm that it sounds likely, that a MEX is the cause of the problem, but this is not a proof.
Jan, that is a good suggestion, and I actually tried running the code in "parallel" mode, i.e. the m-file with the parfor statement, and just changed parfor to for (like you say) - leaving the code for allocating the parallel pool intact. The pool takes some time to create, but after this the code runs fine on the problem PC.
I am not sure what this tells me however, or rather how to move on with the debugging. Perhaps i could try slicing the problem differently. It is not sliced exactly as in my simple example. I have 3D matrices where I slice along the third dimension, but I haven't read anywhere that this is not allowed. Perhaps this plays havoc with the mex-file? Hm, but why not on the other PC:s then.

Sign in to comment.

0 Answers