Strange core usage when running Slurm jobs

4 views (last 30 days)
I'm trying to run jobs on an HPC cluster using Slurm, but I run into problems both when I'm running interactive jobs and when I'm submitting batch jobs.
  1. When I run interactive jobs and I book one node, then I manage to use all of the node's 20 cores. But when I book more than one node for an interactive job, then the cores on the extra nodes are just left unused.
  2. When I run a batch job, then the job uses only one core per node.
Do you have any idea what I might be doing wrong?
1. I book my interactive job from the command prompt using the following commands:
interactive -A myAccountName -p devel -n 40 -t 0:30:00
module load matlab/R2023a
matlab
to submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName"), load the Matlab module and launch Matlab as an X application. Once in Matlab, I first choose the "Processes" parallel profile and second run the "Setup" and "Interactive" sections in the silly little script at the bottom of this question. In two separate terminal sessions, I then use
ssh MYNODEID
htop
where MYNODEID is either of the two nodes assigned to the interactive job. Then I see that the job uses all of the cores on one of the nodes and none of the cores on the second node.
2. To book my batch job, I load and launch Matlab from the command prompt using the following commands
module load matlab/R2023a
matlab
and then run the "Setup" and "Batch" sections in the silly little script at the bottom of this question. Using the same procedure as above, htop lets me see that the job uses two cores (one on each node) and leaves the remaining 38 cores (19 on each node) unused.
Silly little script
%% Setup
clear;
close all;
clc;
N = 1000; % Length of fmincon vector
%% Interactive
x = solveMe(randn(1, N));
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
Cluster.batch( ...
@solveMe, ...
0, ...
{}, ...
'pool', 39 ...
); % Submit a 30-minute 40-core job to the "devel" partition using my account (not actually called "myAccountName")
%% Helper functions
function A = slowDown()
A = randn(5e3);
A = A + randn(5e3);
end
function x = solveMe(x0)
opts = optimoptions( ...
"fmincon", ...
"MaxFunctionEvaluations", 1e6, ...
"UseParallel", true ...
);
x = fmincon( ...
@(x) 0, ...
x0, ...
[], [], ...
[], [], ...
[], [], ...
@(x) nonlinearConstraints(x), ...
opts ...
);
function [c, ceq] = nonlinearConstraints(x)
c = [];
A = slowDown();
ceq = 1 ./ (1:numel(x)) - cumsum(x);
end
end

Accepted Answer

Damian Pietrus
Damian Pietrus on 19 Mar 2024
Based on your code, it looks like you have correctly configured a cluster profile to submit a job to MATLAB Parallel Server. In this case, your MATLAB client will always submit a secondary job to the scheduler. It is in this secondary job that you should request the bulk of your resources. As an example, on the cluster login node you should only ask for a few cores (enough to run your MATLAB serial code), as well as a longer WallTime:
% Two cores, 1 hour WallTime
interactive -A myAccountName -p devel -n 2 -t 1:00:00
module load matlab/R2023a
matlab
Next, you should continue to use the AdditionalProperties fields to shape your "inner" job:
%% Batch
Cluster = parcluster('rackham R2023a');
Cluster.AdditionalProperties.AccountName = 'myAccountName';
Cluster.AdditionalProperties.QueueName = 'devel';
Cluster.AdditionalProperties.WallTime = '0:30:00';
When you call the MATLAB batch command, this is where you can then request the total amount of cores that you would like your parallel code to run on:
myJob40 = Cluster.batch(@solveMe, 0, {},'pool', 39);
myJob100 = Cluster.batch(@solveMe, 0, {},'pool', 99);
Notice that since this submits a completely separate job to the scheduler queue, you can choose a pool size larger than you requested in your 'interactive' CLI command. Also notice that the Cluster.AdditionalProperties WallTime value is shorter than the 'interactive' value. This is to account for the time that the inner job may wait in the queue.
Long story short -- when you call batch or parpool within a MATLAB session that has a Parallel Server cluster profile setup, it will submit a secondary job to the scheduler that can have its own separate resources. You can verify this by manually veiwing the scheduler's job queue.
Please let me know if you have any further questions!
  4 Comments
Fredrik P
Fredrik P on 21 Mar 2024
Alright. I didn't know that "Processes" could only handle a single machine. Good to know.
I'll send you a private message as well.
Here are the two files that you requested.
communicatingSubmitFcn.m
function communicatingSubmitFcn(cluster, job, environmentProperties)
%COMMUNICATINGSUBMITFCN Submit a communicating MATLAB job to a Slurm cluster
%
% Set your cluster's IntegrationScriptsLocation to the parent folder of this
% function to run it when you submit a communicating job.
%
% See also parallel.cluster.generic.communicatingDecodeFcn.
% Copyright 2010-2018 The MathWorks, Inc.
% Get the MATLAB version being used
if verLessThan('matlab', '9.6')
before19A = 'true';
else
before19A = 'false';
end
% Store the current filename for the errors, warnings and dctSchedulerMessages
currFilename = mfilename;
if ~isa(cluster, 'parallel.Cluster')
error('parallelexamples:GenericSLURM:NotClusterObject', ...
'The function %s is for use with clusters created using the parcluster command.', currFilename)
end
decodeFunction = 'parallel.cluster.generic.communicatingDecodeFcn';
if ~cluster.HasSharedFilesystem
error('parallelexamples:GenericSLURM:NotSharedFileSystem', ...
'The function %s is for use with shared filesystems.', currFilename)
end
if ~strcmpi(cluster.OperatingSystem, 'unix')
error('parallelexamples:GenericSLURM:UnsupportedOS', ...
'The function %s only supports clusters with unix OS.', currFilename)
end
enableDebug = 'false';
if isprop(cluster.AdditionalProperties, 'EnableDebug') ...
&& islogical(cluster.AdditionalProperties.EnableDebug) ...
&& cluster.AdditionalProperties.EnableDebug
enableDebug = 'true';
end
% The job specific environment variables
% Remove leading and trailing whitespace from the MATLAB arguments
matlabArguments = strtrim(environmentProperties.MatlabArguments);
variables = {'MDCE_DECODE_FUNCTION', decodeFunction; ...
'MDCE_STORAGE_CONSTRUCTOR', environmentProperties.StorageConstructor; ...
'MDCE_JOB_LOCATION', environmentProperties.JobLocation; ...
'MDCE_MATLAB_EXE', environmentProperties.MatlabExecutable; ...
'MDCE_MATLAB_ARGS', matlabArguments; ...
'PARALLEL_SERVER_DEBUG', enableDebug; ...
'MDCE_BEFORE19A', before19A; ...
'MLM_WEB_LICENSE', environmentProperties.UseMathworksHostedLicensing; ...
'MLM_WEB_USER_CRED', environmentProperties.UserToken; ...
'MLM_WEB_ID', environmentProperties.LicenseWebID; ...
'MDCE_LICENSE_NUMBER', environmentProperties.LicenseNumber; ...
'MDCE_STORAGE_LOCATION', environmentProperties.StorageLocation; ...
'MDCE_CMR', cluster.ClusterMatlabRoot; ...
'MDCE_TOTAL_TASKS', num2str(environmentProperties.NumberOfTasks); ...
'MDCE_NUM_THREADS', num2str(cluster.NumThreads)};
% Set each environment variable to newValue if currentValue differs.
% We must do this particularly when newValue is an empty value,
% to be sure that we clear out old values from the environment.
for ii = 1:size(variables, 1)
variableName = variables{ii,1};
currentValue = getenv(variableName);
newValue = variables{ii,2};
if ~strcmp(currentValue, newValue)
setenv(variableName, newValue);
end
end
% Deduce the correct quote to use based on the OS of the current machine
if ispc
quote = '"';
else
quote = '''';
end
% Specify the job wrapper script to use.
if isprop(cluster.AdditionalProperties, 'UseSmpd') && cluster.AdditionalProperties.UseSmpd
scriptName = 'communicatingJobWrapperSmpd.sh';
else
scriptName = 'communicatingJobWrapper.sh';
end
% The wrapper script is in the same directory as this file
dirpart = fileparts(mfilename('fullpath'));
quotedScriptName = sprintf('%s%s%s', quote, fullfile(dirpart, scriptName), quote);
% Choose a file for the output. Please note that currently, JobStorageLocation refers
% to a directory on disk, but this may change in the future.
logFile = cluster.getLogLocation(job);
quotedLogFile = sprintf('%s%s%s', quote, logFile, quote);
jobName = sprintf('Job%d', job.ID);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% CUSTOMIZATION MAY BE REQUIRED %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% You might want to customize this section to match your cluster,
% for example to limit the number of nodes for a single job.
additionalSubmitArgs = sprintf('--ntasks=%d --cpus-per-task=%d', environmentProperties.NumberOfTasks, cluster.NumThreads);
commonSubmitArgs = getCommonSubmitArgs(cluster, environmentProperties.NumberOfTasks);
if ~isempty(commonSubmitArgs) && ischar(commonSubmitArgs)
additionalSubmitArgs = strtrim([additionalSubmitArgs, ' ', commonSubmitArgs]) %#ok<NOPRT>
end
dctSchedulerMessage(5, '%s: Generating command for task %i', currFilename, ii);
commandToRun = getSubmitString(jobName, quotedLogFile, quotedScriptName, ...
additionalSubmitArgs);
% Now ask the cluster to run the submission command
dctSchedulerMessage(4, '%s: Submitting job using command:\n\t%s', currFilename, commandToRun);
try
% Make the shelled out call to run the command.
[cmdFailed, cmdOut] = system(commandToRun);
catch err
cmdFailed = true;
cmdOut = err.message;
end
if cmdFailed
error('parallelexamples:GenericSLURM:SubmissionFailed', ...
'Submit failed with the following message:\n%s', cmdOut);
end
dctSchedulerMessage(1, '%s: Job output will be written to: %s\nSubmission output: %s\n', currFilename, logFile, cmdOut);
jobIDs = extractJobId(cmdOut);
% jobIDs must be a cell array
if isempty(jobIDs)
warning('parallelexamples:GenericSLURM:FailedToParseSubmissionOutput', ...
'Failed to parse the job identifier from the submission output: "%s"', ...
cmdOut);
end
if ~iscell(jobIDs)
jobIDs = {jobIDs};
end
% set the job ID on the job cluster data
cluster.setJobClusterData(job, struct('ClusterJobIDs', {jobIDs}));
communicatingJobWrapper.sh
#!/bin/sh
# This wrapper script is intended to be submitted to Slurm to support
# communicating jobs.
#
# This script uses the following environment variables set by the submit MATLAB code:
# MDCE_CMR - the value of ClusterMatlabRoot (may be empty)
# MDCE_MATLAB_EXE - the MATLAB executable to use
# MDCE_MATLAB_ARGS - the MATLAB args to use
# PARALLEL_SERVER_DEBUG - used to debug problems on the cluster
# MDCE_BEFORE19A - the MATLAB version number being used
#
# The following environment variables are forwarded through mpiexec:
# MDCE_DECODE_FUNCTION - the decode function to use
# MDCE_STORAGE_LOCATION - used by decode function
# MDCE_STORAGE_CONSTRUCTOR - used by decode function
# MDCE_JOB_LOCATION - used by decode function
#
# The following environment variables are set by Slurm:
# SLURM_NODELIST - list of hostnames allocated to this Slurm job
# Copyright 2015-2018 The MathWorks, Inc.
# Echo the nodes that the scheduler has allocated to this job:
echo The scheduler has allocated the following nodes to this job: ${SLURM_NODELIST:?"Node list undefined"}
if [ "${MDCE_BEFORE19A}" == "true" ]; then
module load intelmpi/17.2
FULL_MPIEXEC=mpiexec.hydra
# Override default bootstrap
# Options are: ssh, rsh, slurm, lsf, and sge
export I_MPI_HYDRA_BOOTSTRAP=slurm
# Ensure that mpiexec is not using the Slurm PMI library
# I_MPI_PMI_LIBRARY must not be defined
unset I_MPI_PMI_LIBRARY
else
# Create full path to mw_mpiexec if needed.
FULL_MPIEXEC=${MDCE_CMR:+${MDCE_CMR}/bin/}mw_mpiexec
fi
export TZ="Europe/Stockholm"
# Label stdout/stderr with the rank of the process
MPI_VERBOSE=-l
# Increase the verbosity of mpiexec if PARALLEL_SERVER_DEBUG or MDCE_DEBUG (for backwards compatibility) is true
if [ "X${PARALLEL_SERVER_DEBUG}X" = "XtrueX" ] || [ "X${MDCE_DEBUG}X" = "XtrueX" ]; then
MPI_VERBOSE="${MPI_VERBOSE} -v -print-all-exitcodes"
fi
# Construct the command to run.
CMD="\"${FULL_MPIEXEC}\" ${MPI_VERBOSE} -n ${MDCE_TOTAL_TASKS} \"${MDCE_MATLAB_EXE}\" ${MDCE_MATLAB_ARGS}"
# Echo the command so that it is shown in the output log.
echo $CMD
# Execute the command.
eval $CMD
MPIEXEC_EXIT_CODE=${?}
if [ ${MPIEXEC_EXIT_CODE} -eq 42 ] ; then
# Get here if user code errored out within MATLAB. Overwrite this to zero in
# this case.
echo "Overwriting MPIEXEC exit code from 42 to zero (42 indicates a user-code failure)"
MPIEXEC_EXIT_CODE=0
fi
echo "Exiting with code: ${MPIEXEC_EXIT_CODE}"
exit ${MPIEXEC_EXIT_CODE}
Damian Pietrus
Damian Pietrus on 21 Mar 2024
Thanks for including that -- It looks like your integration scripts are from around 2018. Since they are a bit out of date, they don't include some changes that will hopefully fix the core binding issue you're experiencing. I'll reach out to you directly, but for anyone else that finds this post in the future, you can get an updated set of integration scripts here:

Sign in to comment.

More Answers (0)

Categories

Find more on Third-Party Cluster Configuration in Help Center and File Exchange

Products


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!