Error when using parallel computing toolbox

44 views (last 30 days)
Florian
Florian on 12 Jul 2023
Commented: Farhad on 6 Oct 2023
Hi,
I am running matlab on a Linux cluster, unsing the parallel computing toolbox. While everything worked out so far, suddenly when I ran my code I get the following error:
Parallel pool failed to start with the following error. For more detailed information, validate the profile 'Processes' in the Cluster Profile Manager.
Error in samplescript_expdata (line 278)
parpool(8)
Caused by:
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Thus I went to the cluster profile manager and ran the validation process. However, the last step fails as well reporting the following lengthy error message that I am going to paste below.
Does anyone have any idea what's wrong and how I can fix this?
Thank you in advance for your help!
Stage started at 2:20:01 PM. Completed in 0 min 30 sec.
Error Report: An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Debug Log: CLIENT LOG OUTPUT
Session starting on cluster type: Local, with name: Processes
Session failed to start when creating InteractiveClient. Error: Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Error in parallel.internal.pool.CppBackedSession.buildInteractiveClient (line 397)
clientSession = parallel.internal.pool.SpfClientSession(bindEndpoint, ...
Error in parallel.internal.pool.AbstractClusterPool>@(c)parallel.internal.pool.CppBackedSession.buildInteractiveClient(c,sessionInfo) (line 839)
sessionBuildFcn = @(c) parallel.internal.pool.CppBackedSession.buildInteractiveClient(c, sessionInfo);
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 100)
[session, connFcn] = sessionBuildFcn(clus);
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 874)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 636)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 510)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, ...
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 391)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 302)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 74)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);
Session failed to start with message: Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Error in parallel.internal.pool.CppBackedSession.buildInteractiveClient (line 397)
clientSession = parallel.internal.pool.SpfClientSession(bindEndpoint, ...
Error in parallel.internal.pool.AbstractClusterPool>@(c)parallel.internal.pool.CppBackedSession.buildInteractiveClient(c,sessionInfo) (line 839)
sessionBuildFcn = @(c) parallel.internal.pool.CppBackedSession.buildInteractiveClient(c, sessionInfo);
Error in parallel.internal.pool.AbstractInteractiveClient/start (line 100)
[session, connFcn] = sessionBuildFcn(clus);
Error in parallel.internal.pool.AbstractClusterPool>iStartClient (line 874)
spmdInitialized = client.start(sessionBuildFcn, sessionInfo, numWorkers, cluster, ...
Error in parallel.internal.pool.AbstractClusterPool.hBuildPool (line 636)
iStartClient(client, sessionInfo, forceSpmdEnabled, cluster, supportRestart, argsList);
Error in parallel.internal.types.ValidationStages>iOpenPoolForCluster (line 510)
aPool = parallel.internal.pool.AbstractClusterPool.hBuildPool('Cluster', cluster, ...
Error in parallel.internal.types.ValidationStages>@()iOpenPoolForCluster(runInfo)
Error in parallel.internal.types.ValidationStages>iCallWithNoHotlinks (line 391)
[varargout{1:nargout}] = fcn();
Error in parallel.internal.types.ValidationStages>iRunParpoolStage (line 302)
[commandWindowOutput, aPool] = evalc(iWrapForEvalc(openPoolFcn));
Error in parallel.internal.types.ValidationStages/run (line 74)
[eventData, runInfo] = obj.RunFunction(obj, runInfo);
Error in parallel.internal.validator.Validator/runValidationSuite (line 191)
[eventData, stageRunInfo] = currentStage.run(stageRunInfo);
Error in parallel.internal.validator.Validator/validate (line 103)
status = obj.runValidationSuite(profileName, suite);
Error in parallel.internal.ui.AbstractValidationManager/validate (line 36)
obj.Validator.validate(profileName, validationSuite);
Error in parallel.internal.ui.ValidationManager.validateProfile (line 36)
parallel.internal.ui.ValidationManager.getOrCreateInstance().validate(profileName, suite);.
Failed to run the DisarmableOncleanup callback due to the following error:
Unrecognized method, property, or field 'pStopLabsAndDisconnect' for class 'parallel.internal.pool.InteractivePoolClient'.

Answers (1)

Debadipto
Debadipto on 1 Aug 2023
Hi Florian,
Please refer to the following article:
If this doesn't solve the issue, then please reach out to MathWorks support for help.
Regards,
Debadipto Biswas
  2 Comments
Farhad
Farhad on 6 Oct 2023
Hello,
i am also running Parallel Server on a cluster with SLURM as scheduler.
I created a generic profile as there is no shared storage between the users(clients) and the worker nodes on the validation process everything is running fine except the last step and i get the same error message posted above .
Unfortunately i cant acces the link you posted.
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
Farhad
Farhad on 6 Oct 2023
Update:
I also have the https://github.com/mathworks/matlab-proxy in use on the cluster.
When i first start a session through the matlab-proxy and then use the Slurm Profile i can successfully run.
The Output shows:
Got clientEndpoint to connect to worker 1 with URL: tcp://tcpnodelay=node0:27583/protocol/catapult
Got clientEndpoint to connect to worker 2 with URL: tcp://tcpnodelay=node0:27370/protocol/catapult
Client starting to connect to workers
Connected to parallel pool with 2 workers.
But when i try the same from my windows matlab client it doesn't work.
I get the same output almost :
Got clientEndpoint to connect to worker 1 with URL: tcp://tcpnodelay=node0:27583/protocol/catapult
Got clientEndpoint to connect to worker 2 with URL: tcp://tcpnodelay=node0:27370/protocol/catapult
But then after while (timeout duration) the connection fails
Error using parallel.internal.pool.SpfClientSession
An unexpected error occurred accessing a parallel pool. The underlying error was: Timeout binding/connecting to specified endpoints.
As i am using a generic profile where i define AdditionalProperties ClusterHost i put in the public available domain name of the Login/Head-Node of the Cluster but the workers themself are not reachable from outside.
So i guess the failure of binding/connecting is due to the fact that there is private Cluster Network beyond the Login Node and the clientEndpoint is not proxied right to the Matlab-client machine(Desktop Windows).
Is there any known issue about it ? Or am i missing some configuration in the generic profile?
Thanks in advance
Best Regards
Farhad

Sign in to comment.

Categories

Find more on Third-Party Cluster Configuration in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!