Matlab Parallel Computing on Cluster - File not found (Task8-32.in.mat)
8 views (last 30 days)
Show older comments
Hi,
I am trying to use Matlab on a cluster with multiple nodes.
For now, I'm trying with 2 nodes of 16 cores each.
I have generated a new Generic Cluster profile using the plugin scripts for sun grid engine (sge).
The independent job validation is working fine, while the spmd, pool and parpool tests fail (only if I use more than 1 node!).
Looking at the job logs, I saw that the problem was related to mw_mpiexec (MPI was crashing).
I tried to use a different mpi -> mpich-4.1.1 and now MPI isn't crashing anymore, however the matlab instances on the different nodes are not able to find the files automatically generated from the validation cases.
I am reporting the log file of the validation attached.
Could you please help me solving this issue?
Thank you,
Antonio
0 Comments
Answers (1)
Raymond Norris
on 16 May 2023
Hi @Antonio Cioffi. I'm not sure why mpiexec is crashing, but I can tell you why you're getting validation issues. When you switch MPI libraries, you need to point MATLAB to the correct libmpi.so. When you say you've tried different MPI, how did you go about it? You'll need to create your own mpiLibConf.m file to point to your libmpi.so (see the documentation for more info).
The reason I can tell you MATLAB is not loading the correct library is because of the following
[28] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[31] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[30] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
[29] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task1"
The [number] is the MPI rank. This is telling you that each worker is creating a file in the folder Job25 with the filename Task1. And they're all "task 1" because they haven't properly started -- they're not aware there are other MPI ranks. What it should show is
[28] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task28"
[31] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task31"
[30] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task30"
[29] 2023-05-08 21:42:58 | About to find job and task using locations "Job25" and "Job25/Task29"
This is an indication that each worker has started up correctly. Therefore, MATLAB must not be finding the correct libmpi.so.
I would suggest you contact support@mathworks.com and they can help you figure out why SGE can't run multi-node (you have passwordless-ssh between the compute nodes, right?).
2 Comments
Raymond Norris
on 18 May 2023
I want to clarify what's happening, though. Notice the following
./mpiexec -info | grep device
will display
--with-device=ch3:nemesis
Therefore, shared memory for intranode communication and TCP for internode (default for nemesis per https://www.mpich.org/static/downloads/3.2.1/mpich-3.2.1-README.txt). If it was built with
--with-device=ch3:nemesis:mxm (Mellanox InfiniBand)
--with-device=ch3:nemesis:ofi
--with-device=ch4:ucx
then traffic would be natively going over IB. Instead, I believe what you are getting is IPoIB.
See Also
Categories
Find more on Third-Party Cluster Configuration in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!