# GPU and CPU Parallelization and Bicg Optimization

9 views (last 30 days)
Zulkuf Azizoglu on 11 Oct 2022
Commented: Zulkuf Azizoglu on 26 Jun 2023
I use a matlab script to solve a big matrix using bicg function. Simply my code looks something like this:
for i=1:n
...
[Pvect] = bicg(AS, BS, tol, maxit,L,U); %where AS, BS, L, and U are different in each loop
%AS is a 10^6x10^6 sparse complex double
%BS=10^6x1 is a sparse complex double
%L&U are 10^6x10^6 sparse sparse complex double
...
end
Every for loop is independent. I recently parallelized this script by using parfor. The computer I use has 128 CPU cores, but I noticed that using parpool(anything more than 32) the local workers are exhausted (i.e., the code run time does not decrease significantly). However, I usually use n=32 (i.e., run the for script for 32 different scenarios), so this is not a big issue for me. The code currently looks something like this:
parpool(32)
parfor i=1:n
...
[Pvect] = bicg(AS, BS, tol, maxit,L,U); %where AS, BS, L, and U are different in each loop
%AS is a 10^6x10^6 sparse complex double
%BS=10^6x1 is a sparse complex double
%L&U are 10^6x10^6 sparse sparse complex double
...
end
I want to further speed up the code using gpuArray (which is supported on bicg). The main reason for that I also use another script where I run the bicg function sequentially many times. So in that case n is 1, but running it many times makes it computationally expensive. However, if possible, I also want to use gpuArrays for cases where n is 32 or more (i.e., the code described above).
I checked the documentation and other user questions, however, I am a little lost on how to utilize cpu and gpu power concurrently. The computer I use has 3 GPU's that I can utilize.
- Should I try to use only the GPUs for both the parfor loop and solution of bicg?
- Or should I run the parfoor loop with CPU power and use all the GPUs for solution of bicg? If so how can do this? As far as I understood, GPU resources will be distributed to each worker in this case.
- Or what would be your suggestion on doing this properly? Thank you very much for any kind of guidance in advance!
The computer that I use is the following (I can also try to use 2 of these computers/nodes in the future. Do you think that would help with any of the scenarios described above?):
GPU: 3x NVIDIA A100 PCIE 40GB
(1 per socket )
gpu0: socket 0
gpu1: socket1
gpu2: socket1
GPU Memory: 40 GB HBM2
CPU: 2x AMD EPYC 7763 64-Core Processor ("Milan")
Total cores per node: 128 cores on two sockets (64 cores / socket )
Hardware threads per core: 1 per core
Hardware threads per node: 128 x 1 = 128
Clock rate: 2.45 GHz
RAM: 256 GB
Cache: 32KB L1 data cache per core
512KB L2 per core
32 MB L3 per core complex
(1 core complex contains 8 cores)
256 MB L3 total (8 core complexes )
Each socket can cache up to 288 MB
(sum of L2 and L3 capacity)
Local storage: 144GB /tmp partition on a 288GB SSD.

Alvaro on 26 Jan 2023
Zulkuf Azizoglu on 26 Jun 2023
Thank you!