Pcg and Parallel Computing Toolbox

Hi MATLAB community, I know that function pcg is supported in the Parallel Computing Toolbox for use in data parallel computations with distributed arrays, i am using a HPC architecture that it's made of 8 nodes, each blade consists of 2 quadcore processors sharing memory for a total of 8 cores and of 64 cores, in total. I run pcg on 1 core and pcg with distributed arrays on 32 cores.
tic
[y]=pcg(A,b,[],100); %first case
toc
A=distributed(A);
b=distributed(b);
tic
[x,flagCG_1,iter] = pcg(@(x)gather(A*x),b,[],100); %second case on 32 cores
toc
i obtained that Elapsed time is 0.001279 seconds. %first case Elapsed time is 0.316632 seconds. %second case on 32 cores
why the time in second case is greater than the time in first case? what am I doing wrong? I tried with larger size matrices but the time in second case is always greater than the time in first case, i probably don't use pcg correctly for distributed arrays. Thanks for your help

Answers (2)

Hey Rosalba,
Using Parallel Computing toolbox for very small problems or for large number of threads can prove to be of no use. That's because -
  1. For very small operations, like A = B + C (B = 2x2, C = 2x2), the compiler would need to make arrangements for parallel code, running the code on separate cores, and then joining the threads. This can have a little more overhead than the serial execution.
  2. For very large number of threads, the OS would be busy in switching the threads, saving the state of each thread, and then loading the thread of each state again. This would result in heavy time loss.
In your problem, the former argument seems to be the case.
As a workaround, you can either run the code without use of distributed arrays or try using parfor loop (code snippet given below) -
parfor i = 1:iter
[x] = f(k,l);
end
Note that the function f should not have any data dependency among the iterations.
For more info. on parfor, you can look at the documentation here.
You can simply do:
dA = distributed(A);
db = distributed(b);
[dx, flag, iter] = pcg(dA, db, [], 100); % dx is a distributed array
x = gather(dx);
But you should be aware that distributed arrays are not designed to be faster than in-memory arrays, they are designed to process arrays that would not fit in your local memory. In practice, operations on distributed arrays are usually slower because of the extra cost for communication, but if the matrix is large enough these operations could not be performed at all otherwise! If you are interested in performance, you may try to use gpuArray -- pcg is supported for gpuArray.

4 Comments

so there is no way to improve performance using my resource (HPC architecture with 64 cores in total)?
These architectures are primarily meant to solve problems that would not fit in the memory of a single node, they are not meant to solve tiny problems faster. Distributed arrays take advantage of these architectures so that instead of getting a "OutOfMemory" error you will be able to run the computation, but you have to keep in mind that this computation is probably quite complex at that scale so it is going to take some time.
Note that the performance of pcg can be drastically improved by using preconditioners such as ilu or ichol. Have you tried using these?
I understood, thanks. But i need some parallel code for the pcg method or a way to use the matlab pcg function in parallel. I probably have to write the code by myself, or could you suggest a code to take inspiration from?
How large a problem are you trying to solve? If you're trying to solve small problems, using parallel code may not save you any time as you've seen because the overhead of setting up the problem in parallel may outweigh any savings you get from running in parallel.
Picture you and three young children go to the grocery store (before COVID-19 of course.) If you split the four items on your shopping list among your group are you really saving time? Maybe, but if you do save time you probably don't save a lot of time. And depending on how mature the children are, you may very well lose time. How about if you split the 100 items on your list into four segments? Then you may save more time than the four item list case.

Sign in to comment.

Categories

Find more on Parallel Computing in Help Center and File Exchange

Products

Release

R2013a

Asked:

on 7 Sep 2020

Answered:

on 16 Sep 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!