Problem with parallel configuration. Parallel job test validation failed!!
Show older comments
We want to set up a cluster of two PCs (intel core i5 with 4 cores per machine). We are using the release of MATLAB 2009b and the admin center to generate a job manager with 4 workers, one core per worker (2 workers per machine). The mdce is installed in the two machines with the default mdce_def. This process works fine.
The problems appear when we try to run a parallel configuration, using this job manager with a minimun and maximun of 4 workers, because the parallel test fail.
This process generates several error lines in the mdce-service.log in log folder:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPI_Comm_connect(119).....................: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:0: localhost: 1: Fatal error in MPI_Comm_connect: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPID_Comm_connect(187)....................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_Comm_connect(405)...................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Sendrecv(126)........................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPI_Comm_connect(119).....................: MPI_Comm_connect(port="tag=0 port=28351 description=lp-apd12 ifname=172.22.4.92 ", MPI_INFO_NULL, root=0, comm=0x84000000, newcomm=0000000001023A60) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIC_Wait(270)............................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPID_Comm_connect(187)....................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_Comm_connect(405)...................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Sendrecv(126)........................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIC_Wait(270)............................:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDI_CH3I_Progress_handle_sock_event(420):
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:job aborted using terminate/kill:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:out:MPIDU_Sock_wait(2603).....................: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:process: node: exit code: error message:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:0: localhost: 1: Fatal error in MPI_Intercomm_merge: Other MPI error, error stack:
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(284): MPI_Intercomm_merge(comm=0xc4000005, high=1, newintracomm=0000000001023A68) failed
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:out:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-15:err:MPIDU_Sock_wait(2603).....................: The specified network name is no longer available. (errno 64)
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-16:err:MPI_Intercomm_merge(262): Too many communicators
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-17:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "cp".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "nodisplay".
INFO | jvm 1 | 2011/08/18 16:09:04 | Thu Aug 18 16:09:04 CEST 2011:Group-18:out:Warning: Unrecognized MATLAB option "Djava.security.policy=C:\Program Files\MATLAB\R2009b\toolbox\distcomp\config\jsk-all.policy".
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-18:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:Warning: Unable to locate a personal folder for $documents\MATLAB
INFO | jvm 1 | 2011/08/18 16:09:05 | Thu Aug 18 16:09:05 CEST 2011:Group-17:out:{Warning: Userpath must be an absolute path and must exist on disk.}
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-17:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: To get started, type one of these: helpwin, helpdesk, or demo.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out: For product information, visit www.mathworks.com.
INFO | jvm 1 | 2011/08/18 16:09:06 | Thu Aug 18 16:09:06 CEST 2011:Group-18:out:
INFO | jvm 1 | 2011/08/18 16:09:07 | Thu Aug 18 16:09:07 CEST 2011:Group-17:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker02
INFO | jvm 1 | 2011/08/18 16:09:08 | Thu Aug 18 16:09:07 CEST 2011:Group-18:out:» Thu Aug 18 16:09:07 CEST 2011 Worker started: pc-goba_worker01
Thanks
Accepted Answer
More Answers (3)
Jason Ross
on 18 Aug 2011
0 votes
In Admin Center, if you run the connectivity test (Hosts > Test Connectivity) are there any errors or warnings?
1 Comment
Thomas O'Donnell
on 29 May 2013
Where is the Admin Center ? I would like to view the health of my MATLAB 2012A Parallel server
Gonzalo Blanco
on 18 Aug 2011
0 votes
3 Comments
Jason Ross
on 18 Aug 2011
The easiest way to do this is to turn off any firewalls that are up. If you can't take down the firewall, you will need to open communications ports in them to allow the nodes to communicate.
You can details of the ports here:
http://www.mathworks.com/products/distriben/requirements.html
Gonzalo Blanco
on 19 Aug 2011
Jason Ross
on 19 Aug 2011
Is the firewall off on all of the machines in the cluster?
Do the errors/warnings persist in the Admin Center?
Are there other things running which might also be blocking communication? Virus scanners, malware scanners, etc -- they might block this kind of thing as "suspicious activity"
Jason Ross
on 19 Aug 2011
0 votes
Other things you might want to look for:
From the "The specified network name is no longer available. (errno 64)" error message -- check that every host has correct forward and reverse DNS lookups in place, and that your DNS is reliable. Check the error logs on the host to see if something is going on here.
Check your system PATH to see if there are other MATLAB installs on the path. The error stack that starts with "Unrecognized MATLAB option "cp"." and then continues on with "nodisplay", "Djava.security.policy" and so on makes it look like something is starting MATLAB in a way that's not expected. If you haven't set ClusterMatlabRoot to the installation of MATLAB and are using "on path", you might want to try setting it to the MATLAB installation you want to use.
7 Comments
Gonzalo Blanco
on 22 Aug 2011
Jason Ross
on 22 Aug 2011
You need to make sure that the version of MATLAB you are using on the cluster matches with the version of MATLAB on the client exactly. Mixing and matching versions is most definitely not supported, tested or known to work at all. I really highly recommend cleaning up the PATH so that you aren't inadvertently picking up differing versions of MATLAB (or other programs/utilities that may be present -- system PATHs with eclipsed programs have caused more issues than I can count in my career). Setting ClusterMatlabRoot in your configuration will select that MATLAB as the one to be used.
Gonzalo Blanco
on 23 Aug 2011
Gonzalo Blanco
on 23 Aug 2011
Jason Ross
on 23 Aug 2011
Have you re-run the tests in the admin center? Are they free of errors and warnings?
Gonzalo Blanco
on 23 Aug 2011
Alexandre Malotchko
on 6 Apr 2016
does not respond to java.net.InetAddress.isReachable(): on R2015b means that all machines involved need to have ECHO service on port 7 - on windows 7 for instance, you need to install MSFT Simple TCP Services feature and configure firewalls to allow port 7 traffic.
Categories
Find more on Job and Task Creation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!