error using matlabpool open local in a cluster

5 views (last 30 days)
Hi,
I am using matlabpool local to do parallel calculations on a cluster which uses Torque as a scheduler. The MATLAB version is R2012b. I use "matlabpool open local 12" for each job which is submitted to the cluster by the command "qsub". Sometimes it works and sometimes it gets these errors:
Error using distcomp.interactiveclient/start (line 90)
Failed to start matlabpool.
Error using parallel.Job/submit (line 290)
A communicating job must have a single task defined before submission.
or
Error using distcomp.interactiveclient/start (line 90)
Failed to start matlabpool.
Error using parallel.Job/submit (line 290)
Operands to the || and && operators must be convertible to logical
scalar values.
}^H
^Gcat: /tmp/diagnose/start_time: No such file or directory
/curc/torque/nodes/node0247/mom_priv/epilogue: line 69: /tmp/diagnose/start_time: No such file or directory
/curc/torque/nodes/node0247/mom_priv/epilogue: line 71: /tmp/diagnose/start_time: No such file or directory
/curc/torque/nodes/node0247/mom_priv/epilogue: line 72: /tmp/diagnose/start_time: No such file or directory
or
Error using
distcomp.interactiveclient/pGetSockets>iThrowIfBadParallelJobStatus (line
90)
The interactive communicating job is unavailable. It is likely that the
communicating job has been deleted.
}^H
^G
or
Error using distcomp.interactiveclient/start (line 90)
Failed to start matlabpool.
Error using parallel.Cluster/createCommunicatingJob (line 82)
Unable to write to MAT-file
/home/jhya1786/.matlab/local_cluster_jobs/R2012b/Job1.in.mat
File may be corrupt.
}^H
^Gcat: /tmp/diagnose/start_time: No such file or directory
/curc/torque/nodes/node0416/mom_priv/epilogue: line 69: /tmp/diagnose/start_time: No such file or directory
/curc/torque/nodes/node0416/mom_priv/epilogue: line 71: /tmp/diagnose/start_time: No such file or directory
/curc/torque/nodes/node0416/mom_priv/epilogue: line 72: /tmp/diagnose/start_time: No such file or directory
Thanks for answering!

Answers (2)

Jason Ross
Jason Ross on 7 Dec 2012
Edited: Jason Ross on 7 Dec 2012
If I'm understanding things correctly, you submit a job to a Torque cluster using "qsub (plus some list of arguments)". The MATLAB that's running on each node has Parallel Computing Toolbox installed on it, as the "matlabpool open local" command succeeds, but it doesn't sound as if the Torque cluster has been set up to use MATLAB, or if it is, you aren't using the integration for some reason. When your code runs, the scheduler executes this script on the nodes in the cluster, under your username. One of the lines in the script is "matlabpool open local 12".
I can see why this is falling over.
  • The local scheduler has a dependency on files in /home/jhya1786/.matlab/local_cluster_jobs/R2012b. As you can see from the error output, there are files called Job1, Job2, etc that are created as the job progresses. Since you have only one home directory, when you launch your command on the cluster, it's trying to start up a local cluster on each cluster node. This means each node is trying to write to the same location in your home directory, with each node trying to make Job1, Job2, JobN simultaneously. Needless to say, this will lead to a collision as each process knows nothing about the other, and they will step on each other as intermediate files are created and removed.
  • I would expect that the reason it sometimes works is that in some cases your job gets scheduled on one node at a sufficiently different start time than another, so the files don't collide.
  • When you are running "matlabpool open local 12", you are spawning 12 additional MATLAB processes on the nodes in which the job is executing. This may actually be less efficient and slowing you down, depending on how work is scheduled on your cluster, or depending on the number of processors in each node. For example, if you've got 8 processors in a node, you are over-subscribing the number of MATLABs per processor. It's also possible that you are taking up all the RAM on each node and forcing the node to use swap, which will also kill performance, as well.
  • There is an integration with Torque that allows for better scheduling of work on a Torque cluster. You would need to check with your cluster admins to see if it's been set up. If it has been set up, they should be able to provide you with a cluster profile that will let you submit your code to the cluster directly from MATLAB, which uses qsub under the hood to submit the work. This integration avoids the collisions in your home directory that you are seeing.
  • The good news is that if you have code that already works with the local scheduler, once you have the profile for the Torque cluster it should work there, as well, with no changes or very minimal ones (for example, telling the cluster where to find files, or having to attach the files to the job)

Thomas Ibbotson
Thomas Ibbotson on 4 Dec 2012
Hi Jhih-An
If I understand correctly, you are submitting jobs to a cluster, which attempt to open a matlabpool on the cluster using the local scheduler. Instead of doing this I would recommend using a communicating job of type 'pool' to run your code on the cluster, and remove the 'matlabpool open local 12' from the job you are trying to run.
For more information about running code using communicating jobs of type 'pool' see the createCommunicatingJob documentation or the documentation for a batch parallel loop.
Thanks, Tom
  2 Comments
Jhih-An Yang
Jhih-An Yang on 7 Dec 2012
Edited: Jhih-An Yang on 7 Dec 2012
Unfortunately, the same problem happens even when I use a communicating job.

Sign in to comment.

Categories

Find more on Cluster Configuration in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!