Parallel Computing Maximize CPU Usage

5 views (last 30 days)
Jasper Mark on 14 May 2021
Edited: Edric Ellis on 17 May 2021
Hi,
I am relatively new to working with parallel codes and am utilizing a super computer insitute to bear the grunt of the CPU and RAM load. I am running this code and am having a difficult time maximizing my CPU efficiency. After running with 36 cores, the CPU efficiency was 30%. I then dropped it to 10 cores (~30* of 36) and only noted a CPU efficiency of 70%.
I would love to maximize this script to 100% at 36 cores as that would greatly speed up this processing, but am at a lost as to why it isn't automatically maximizing my CPU efficiency.
Cheers,
JM
function cohSquare = cohAllReg(regions, fd)
%Create empty matrix for cohSquare
cohSquare = cell(size(regions, 2)+1, size(regions, 2)+1);
%Create x and y labels corresponding to ROIs
yLabel = horzcat(' ', regions(1, :));
xLabel = horzcat(' ', regions(1, :))';
%Define length of index
range = (size(regions, 2)+1);
%Create coherence value upper triangle matrix for left sided stroke
if fd.lesion_side =='l'
parfor i = 2:range
for j = 2:range
if i <= j
cohSquare{i, j} = getCoh(regions{2, i-1}, regions{2, j-1}, fd);
end
end
end
%Create coherence value upper triangle matrix for right sided stroke
else fd.lesion_side == 'r'
parfor i = 2:range
for j = 2:range
if i <= j
cohSquare{i, j} = getCoh(regions{5, i-1}, regions{5, j-1}, fd);
end
end
end
end
cohSquare(1, :) = xLabel;
cohSquare(:, 1) = yLabel;
end
delete(gcp('nocreate'));

Edric Ellis on 17 May 2021
Edited: Edric Ellis on 17 May 2021
There are several reasons why parfor may not be able to do a perfect job of speeding up your calculation. These basically boil down to two broad categories:
1. Overheads associated with running in parallel (dividing up the work, sending stuff to and from the workers, imperfect scheduling of the work - i.e. some workers left idle towards the end of the loop).
2. Intrinsic hardware limitations - not every single-threaded program can be perfectly accelerated on a given machine. One common cause here is access to memory. For instance you might run out of cache memory, or main memory bandwidth. This can be particularly challenging to diagnose. One way is to use hardware performance counters. A simpler way is to run the core piece of your computation with increasing contention on your target system, and see whether the performance degrades (it often does).
You can investigate the overheads by using ticBytes/tocBytes. However, that's not always the whole picture - things depend on how "far" away the workers are. It's not clear if you're running on a single system or not.
Digging into the second point can be done by timing execution on the workers as you increase the number of concurrent computations. One way is to use spmd, something a bit like this:
parpool(36);
for ii = 1:36
spmd (ii) % Limit the number of workers for the SPMD block
t = tic();
% Run an inner loop to ensure the timings are not dominated by spmd
% overheads. Aim to get the block to take 5-10 seconds with a
% single worker
for jj = 1:N
getCoh(. . .);
end
t = toc(t);
end
t = t{1}; % retrieve time from worker
% Display (or capture) the time as contention increases.
[ii,t]
end