GPU and CPU Parallelization and Bicg Optimization

Question

0 votes

I use a matlab script to solve a big matrix using bicg function. Simply my code looks something like this:

for i=1:n
    ...
    [Pvect] = bicg(AS, BS, tol, maxit,L,U); %where AS, BS, L, and U are different in each loop
    %AS is a 10^6x10^6 sparse complex double
    %BS=10^6x1 is a sparse complex double
    %L&U are 10^6x10^6 sparse sparse complex double
    ...
end

Every for loop is independent. I recently parallelized this script by using parfor. The computer I use has 128 CPU cores, but I noticed that using parpool(anything more than 32) the local workers are exhausted (i.e., the code run time does not decrease significantly). However, I usually use n=32 (i.e., run the for script for 32 different scenarios), so this is not a big issue for me. The code currently looks something like this:

parpool(32)
parfor i=1:n
    ...
    [Pvect] = bicg(AS, BS, tol, maxit,L,U); %where AS, BS, L, and U are different in each loop
    %AS is a 10^6x10^6 sparse complex double
    %BS=10^6x1 is a sparse complex double
    %L&U are 10^6x10^6 sparse sparse complex double
    ...
end

I want to further speed up the code using gpuArray (which is supported on bicg). The main reason for that I also use another script where I run the bicg function sequentially many times. So in that case n is 1, but running it many times makes it computationally expensive. However, if possible, I also want to use gpuArrays for cases where n is 32 or more (i.e., the code described above).

I checked the documentation and other user questions, however, I am a little lost on how to utilize cpu and gpu power concurrently. The computer I use has 3 GPU's that I can utilize.

- Should I try to use only the GPUs for both the parfor loop and solution of bicg?

- Or should I run the parfoor loop with CPU power and use all the GPUs for solution of bicg? If so how can do this? As far as I understood, GPU resources will be distributed to each worker in this case.

- Or what would be your suggestion on doing this properly? Thank you very much for any kind of guidance in advance!

The computer that I use is the following (I can also try to use 2 of these computers/nodes in the future. Do you think that would help with any of the scenarios described above?):

GPU: 	3x NVIDIA A100 PCIE 40GB
(1 per socket )
gpu0: socket 0
gpu1: socket1
gpu2: socket1
GPU Memory: 	40 GB HBM2
CPU:  	2x AMD EPYC 7763 64-Core Processor ("Milan")
Total cores per node:  	128 cores on two sockets (64 cores / socket )
Hardware threads per core:  	1 per core
Hardware threads per node:  	128 x 1 = 128
Clock rate:  	2.45 GHz
RAM:  	256 GB
Cache:  	32KB L1 data cache per core
512KB L2 per core
32 MB L3 per core complex
(1 core complex contains 8 cores)
256 MB L3 total (8 core complexes )
Each socket can cache up to 288 MB
(sum of L2 and L3 capacity)
Local storage:  	144GB /tmp partition on a 288GB SSD.