Restart a parpool worker
11 views (last 30 days)
When I run parfor, sometimes a worker terminates with some error and the simulation continues with the remaining workers. But is there a way to automatically restart the parpool worker without having to stop and relaunch the simulation? I am at my wits end as to how to achieve it.
Edric Ellis on 16 Jun 2020
There's no simple way to do this when using parfor with parpool unfortunately. I can think of a couple of workarounds that might help, depending very much on how your problem is set up.
Firstly, you could try the "cluster parfor" approach where you don't launch a parpool at all, and instead let the cluster run the loop directly. This is described in the doc here: https://www.mathworks.com/help/parallel-computing/parforoptions.html (See the section "Run parfor on a Cluster Without a Parallel Pool"). This approach launches independent tasks on your cluster rather than a parallel pool. This will only get decent performance if the time taken to launch the workers for the independent tasks is not significant compared to the time taken to run the entire loop. If it works for you, this is highly likely to be the simplest approach.
Secondly, if you can restructure your code to use parfeval instead of parfor, you could check the NumWorkers property of the parallel pool while consuming results, and if it decreases, restart the pool. This would be a bunch more work because you'd need to keep track of the incomplete work, and you'd have to re-submit it.
A third approach might be to restructure your parfor loop to send its results back using a DataQueue . Also, by launching the parpool using the 'SpmdEnabled', true option, the pool will automatically shut down any time a worker crashes. Then, the idea would be that the client stores the partial results of your loop using the DataQueue. The parfor loop would terminate with an error when a worker crashes, but you'd have the partial results and therefore would be able to re-start a new pool, and run a parfor loop over the incomplete iterations.