Parallelizing MATLAB code using many GPU cores

I have a MATLAB script that runs many independent iterations (for loop), of the form
for idx=1:N
result(idx) = some_procedure(data(idx));
end
I have a NVIDIA graphics card with over 3000 CUDA cores. Is it possible to parallelize the code, such that e.g. each GPU core handles one iteration? I understood that parfor is not the answer here but is there some equivalent?

 Accepted Answer

GPU cores do not work like CPU cores. They cannot run independent tasks.
To use the GPU with MATLAB, start with the documentation.

10 Comments

Hello,
I'm sorry to correct you. Not only for GPU but for any device which allow SMP (symmetric multiprocessing), the independence is a necessary condition for doing parallel computing.
So a FOR LOOP of independent jobs may be very well parallelized, the single condition is that your environment must allow you to do that. For example the PARFOR for the CPU parallel execution.
Natively CUDA allows that, the FOR will be distributed among the cores of the GPU.
The question was "how to do that inside Matlab?"... Your answer informs the questioner in the opposite way.
Best regards
This is a fair point since there is ambiguity in the question. GPU threads can process arrays of data when there are no dependencies between threads, as long as the operations they are performing are the same; unlike CPU cores which can do entirely different things on different threads.
I answered the question "how to do that inside Matlab" by directing the OP towards the documentation for gpuArray. Generally that's preferable when the question is as general as this one, since I cannot presume to know exactly what bit of the documentation will answer the question; and the asker should familiarize themselves with the background material before asking clarifying questions.
Hope that is satisfactory.
Y.Yang
Y.Yang on 19 Apr 2020
Edited: Y.Yang on 19 Apr 2020
Why gpuArray with for-loop does not significantly increase the speed compared to parfor-loop? I am trying to code a convolution network with general code without using deep learning toolbox as I have to design some different algorithms to train it. Without deep learning toolbox, it takes me a lot of time to complete one epoch training. Then I was thinking to use gpuArray instead of parfor-loop as I believe it would be much faster. However, when I transfer data to GPU and conduct for-loop, GPU usage shows around 5-10 percent. The speed improvement does not quite significantly.
Any suggestions on this? Many thanks.
I can't say because there's no real explanation here of what you're trying to do inside the loop - what some_procedure is.
The most common mistake for someone new to gpuArray is to assume that MATLAB will parallelize your serial code for you, by putting the body of a for-loop, say, into a kernel that will execute on multiple GPU threads. This is not surprising because that is indeed the way parfor works on CPU cores. But GPUs do not work like that. You need to write highly vectorized code, using techniques such as those documented in MATLAB's documentation. There actually is a way to have MATLAB create a kernel for you, using gpuArray/arrayfun, although this isn't generally necessary.
If data(idx) is, as it appears to be, a scalar, then it does look as though this is your problem and what you need is to read the GPU documentation, learn about vectorization, and try to rework your algorithm so that it no longer contains loops. However, it could be that you know all this, you have vectorized your code, and all that's happening is that you were expecting some_procedure to run faster with gpuArray inputs. In order to diagnose that, we're going to have to see what you're doing in that function.
Thanks Joss. What I am going to do is that I have hundreds of different matrix with a 2-D dimensionality, and I need to conduct thousands of matrix multiplication or matrix inverse. Also, the output of each matrix operation (matrix multiplication or inverse) is a matrix as well. Therefore, I have a super big nested loop and was using parfor.
I also tried arrayfun or cellfun but did not successfully use it. I guess the arrayfun is only allowed to run under the element-wise condition, but not this matrix multiplication or inverse operations.
I guess if the operation could be run in arrayfun, the speed could be significantly improved. Right now, compared to deep learning toolbox function (activiation function), my current code runs around 50 percent slower than that.
What you need is pagefun. Give it a try, it supports mtimes and mldivide and inv among others.
I have another problem regarding the speed. I have used pagefun, which gives me amazing speed. However, as I have a super large number of matrix multiplication, I still need to speed up my codes. Then I am thinking to use half or sparse data type, rather than single in order to gain faster speed and save memory. However, pagefun does not support the half or sparse data type. Could you please give me some suggestions? Thanks.
You should look elsewhere for further performance improvements. MATLAB has no half datatype, and sparse only supports 2-D matrices so cannot be used for batch operations, not that it would be useful anyway since sparse only makes sense for large matrices.

Sign in to comment.

More Answers (0)

Asked:

on 29 Aug 2018

Commented:

on 25 Apr 2020

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!