- High level: Let's say you're processing 10k files. If MATLAB where paused and restarted, it would need to know which file to continue with. There's no chance it could continue with the current file it was processing (see low level), but you could add into your code a "semiphore" the leaves behind the last file you processed. Of course the parfor doesn't necessarily process the file list in order, so knowing that you've processed file #234 doesn't mean 1-233 are done. When you run your code, you need to treat it like it could be the 1st time your running it (i.e., not from a hibernation) or the Nth time (i.e., from a hibernation). You need to get the list of files to process, weed out the ones you've already processed and then start your parfor with the remaining files. Other than coming up with a semiphore scheme, this isn't as difficult as it sounds. At the end of the day, the parfor is just processing whatever files you give it. You just need to figure out the correct list.
- Low level: This is the larger (impossible?) issue. For any given file from above, you have calculations in memory. You might be calculating the fft of a matrix. Or maybe you have distributed arrays or gpuArrays loaded in memory. This would really be where you need MATLAB to have its own checkpointing and is the reason why when you come from hibernation, you'd need to reprocess the last files you where currently working on.
Parallel Computing Toolbox, Pausing, and Hibernation
10 views (last 30 days)
Suppose we are running a code with the Parallel Computing Toolbox on a laptop and for some reason, you must leave and take your laptop with you. The code unfortunately takes a significant amount of time even with the aforementioned toolbox. Is it possible in this case to pause the execution and hibernate the system without terminating the execution? Assume we are working on a Windows 11 operating system.
The reason I'm asking this is that I noticed recently that if I paused the code for a while (not sure how many minutes), the code ceases to run again and I had tor rerun the code from scratch. Any help in this matter is much appreciated.
Raymond Norris on 20 Apr 2022
What you're asking for would require "checkpointing", where an application has the ability to stop midstream and resume later. checkpointing provides the application to know where it halted and what is required to continue. MATLAB doesn't have checkpointing built-in to it, that is, it's never tracking how much is done, what left to be done, and most importantly, a restart operation.
Let's think about a couple of scenarios, at a high level and at a low level.