Inconsistent lost connection with worker error

34 views (last 30 days)
rp
rp on 7 Feb 2018
Answered: Manhui Wang on 8 Feb 2019
I have a program which runs an spmd code block. At the end of the block, I have each worker save their workspace to file. Sometimes I get the following error:
The client lost connection to worker #. This might be due to network problems, or the interactive communicating job might have errored.
Based on printed output from my code, I know that the error is most likely occurring near the save the workspace portion, after the rest of the program has executed.
This error does not always happen, however. I find it generally happens more often when the workers are trying to save larger files, but not always. I can run the same code twice and once it will error and once it will not. I am running the code on a server, so I'm not sure if the memory demands on the server might be contributing (if it's a memory issue). Any thoughts?
EDIT:
Due to the fact that the processes are sending messages frequently in the spmd block, it is likely that the the writing of the files is happening simultaneously -- I wonder if on these larger files there's a higher probability of writing to the same disk space and creating corrupt files (often the .mat files exist but cannot be read). Perhaps forcing the program to save sequentially will help?
EDIT:
I also get the following message when it fails to write the files:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
  3 Comments
rp
rp on 8 Feb 2018
I believe it's just the Parallel computing toolbox. Matlab version 9.3.0.713579 (R2017b). CentOS Linux release 7.4.1708 (Core).
rp
rp on 8 Feb 2018
Also, I have been using the save function in the spmd block via another function to get the job done -- I'm wondering if that might be contributing to something, since apparently that's a bad thing.
https://www.mathworks.com/matlabcentral/answers/215594-saving-within-spmd-or-parfor

Sign in to comment.

Answers (2)

Jiannan Zhou
Jiannan Zhou on 25 Aug 2018
I encountered exactly the same problem on R2017b, using parallel computing tool and save function.

Manhui Wang
Manhui Wang on 8 Feb 2019
I see the similar problem with R2017b:
message with properties:
Identifier: 'MATLAB:connector:connector:ConnectorNotRunning'
Arguments: {}
but it appears to work fine with R2018a.

Categories

Find more on MATLAB Parallel Server in Help Center and File Exchange

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!