Summing array elements seems to be slow on GPU
Show older comments
I am testing the times of execution for the following function on CPU and GPU
function funTestGPU(P,U,K,UN)
for k = 1:P
H = exp(1i*K);
HU = U.*H;
UN(k,:) = sum(HU,[1,3]);
end
end
where
,
are complex arrays of size
and Kis a complex array of size
. So in each iteration I perform element-wise exp(), element-wise multiplication of two arrays and summing elements of 3D array along two dimensions.
I test the execution time on CPU and on GPU with the help of the following script
P = 200;
URe = 1/(sqrt(2))*rand(P);
UIm = 1/(sqrt(2))*rand(P);
KRe = 1/(sqrt(2))*rand(P,P,P);
KIm = 1/(sqrt(2))*rand(P,P,P);
% CPU
U = complex(URe, UIm);
K = complex(KRe, KIm);
UN = complex(zeros(P), zeros(P));
fcpu = @() funTestGPU(P,U,K,UN);
tcpu = timeit(fcpu);
disp(['CPU time: ',num2str(tcpu)])
% GPU
U = gpuArray(complex(URe, UIm));
K = gpuArray(complex(KRe, KIm));
UN = gpuArray(complex(zeros(P), zeros(P)));
fgpu = @() funTestGPU(P,U,K,UN);
tgpu = gputimeit(fgpu);
disp(['GPU time: ',num2str(tgpu)])
and I obtain the results
CPU time: 9.0315
GPU time: 3.3894
My concern is that if I remove the last operation from the funTestGPU (summing array elements) I obtain the results
CPU time: 8.0185
GPU time: 0.0045631
So it looks like the summation is the most time-consuming operation on GPU. Is that an expected result?
I wrote the analogical codes in cuPy and in Pytorch and there the summation does not seem to be the most time consuming operation.
I use Matlab 2019b. My graphics card is NVIDIA GeForce GTX 1050 Ti (768 CUDA cores), my processor is AMD Ryzen 7 3700X (8 physical cores).
Accepted Answer
More Answers (1)
Joss Knight
on 27 Apr 2023
Moved: Matt J
on 27 Apr 2023
1 vote
Why are you recomputing H and HU inside the loop? They do not change. If you remove the sum, because the results are never used from the first (P-1) iterations, only the last computation of those values will actually take place.
6 Comments
Matt J
on 27 Apr 2023
Very strange. I wonder if it is wise to have this "optimization". Essentially, it causes the user's instructions to be disobeyed.
Joss Knight
on 27 Apr 2023
Most people using the GPU want every optimization they can get. For instance, strictly speaking a user who has written
C = A.'*B;
Has requested that A be transposed, but in fact this never happens.
There is no way the user can see the underlying behaviour. All the instructions are recorded and if the user attempts to access the results of any operation it will be computed.
Damian Suski
on 27 Apr 2023
Edited: Damian Suski
on 27 Apr 2023
Joss Knight
on 27 Apr 2023
Yes. You will get better performance from computing the result in a single sum, but you will probably run out of memory so would have to do it in batches:
function funTestGPU(P,U,K,UN)
HUall = zeros(P,P,P,P,'like',U);
for k = 1:P
H = exp(1i*K);
HUall(:,:,:,k) = U.*H;
end
UN = sum(HUall,[1,3]);
UN = permute(HUall,[4 2]);
end
Damian Suski
on 28 Apr 2023
Damian Suski
on 18 May 2023
Categories
Find more on Get Started with GPU Coder in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!