how to run a function on GPU cores

20 views (last 30 days)
Memo Remo
Memo Remo on 24 Jan 2023
Answered: Joss Knight on 26 Jan 2023
I want to run a very simple custom function on GPU. This function repeat reading an image and perform some predictions using an available machine learning model (mdl) by the "predict" command. Can anyone guide me in performing these calculations on each GPU core?
I am not sure if my understanding is correct but using CPU parallel processing, we can run such commands on each core at a time. For instance, if we have 12 images and 12 cores in our CPU, then using the "parfor loop", we can run the function on 12 images simultaneously. I want to do this using the GPU. I know that GPUs usually have a lot of cores (~1000s), so it would be great if we could have a function like "parfor" for GPU-based parallel processing.
I would appreciate any help you could provide.

Accepted Answer

Matt J
Matt J on 24 Jan 2023
so it would be great if we could have a function like "parfor" for GPU-based parallel processing.
No, you cannot use GPU cores in a manner that is analogous to parpool workers. It is a very different ype of "core" than what CPUs have.
  3 Comments
Matt J
Matt J on 24 Jan 2023
Not using Parallel Computing Toolbox GPU techniques. You might be able to GPU-accelerate it, though, using the GPU Coder, if you have it.
Memo Remo
Memo Remo on 24 Jan 2023
Okay. Thanks for letting me know.
Our University should have it, but I don't know how to work with it. I can read the documentation.
Thanks for your help!

Sign in to comment.

More Answers (2)

Walter Roberson
Walter Roberson on 24 Jan 2023
Nvidia GPUs are constructed in a hierarchy. GPU cores individually cannot decode instructions and cannot run programs independently of others in the same group. GPU cores are grouped together and each group has a controller. Controllers are somewhat independent of each other. Each controller decodes an instruction and makes the information available on a bus shared by the group of cores. Each core in the group then executes the same instruction, just with different addresses.
Conditional work is not handled by having different cores in a group execute different instructions. Instead, the controller has a way of informing each core whether it is to perform the action, or if instead it should just idle. So if/else is handled by having a subset of the cores sit idle for one branch, and then the controller changes the information about which cores are to take part and has the other cores sit idle when some cores do work.
In traditional CPUs, each physical core (not hyperthreaded core) has its own instruction decoder and register set and can execute independently. The number of simultaneous independent programs is based on the number of physical cores.
In Nvidia GPUs, the number of simultaneous independent programs is based not on the number of cores, but rather on the number of controllers. The old generations of Nvidia GPUs started with only two or three controllers. The number has increased a fair bit in more modern models, but you should still expect each controller to be controlling at least 128 cores.
  4 Comments
Walter Roberson
Walter Roberson on 24 Jan 2023
You need to distinguish between running programs on the GPU, and running functions on the GPU. The Parallel Computing Toolbox provides a gpuArray class for variables intended to "live" on the GPU. At the MATLAB level, the programmer codes operations on the gpuArray variables. MATLAB does not necessarily instruct the GPU to start the operation immediately. Instead, MATLAB can accumulate a chain of operations. For example,
xG = gpuArray(x); %logically copy x into the GPU
t1 = sin(xG);
t2 = t1.^2;
t3 = 1-t2;
yG = acos(t3);
y = gather(yG); %bring result back from GPU
or more briefly,
xG = gpuArray(x); %logically copy x into the GPU
y = gather(acos(1-sin(xG).^2));
Instead of instructing the GPU to run the sin() operation, and then instructing it to run the square operation, and so on, the MATLAB level can internally queue operations that will need to be performed. When the queue gets too complicated, or the buffer gets filled up, or there is some kind of internal timeout of not having queued more operations, then MATLAB copies in appropriate pre-compiled GPU kernels to the GPU that chained together will have the desired outcome (at least part way), and then tell the GPU to execute. When the GPU tells MATLAB it is finished that set of operations, MATLAB can copy in more kernels if there is still more to do, and tell the GPU to execute those, and so on. Eventually there is either a fatal error or all queued operations are complete. MATLAB then leaves the results sitting on the GPU expecting that they might be needed again. At some point the user requests to gather() the results back from the GPU; if the queued operations are not yet completed then MATLAB waits for the GPU to finish; when the GPU is all finished, the results are copied back from the GPU to MATLAB. When the output variables are cleared / deleted on the MATLAB side then MATLAB knows it can release the results on the GPU.
Notice that in this process, what is executed on the GPU is only pre-compiled GPU kernels -- either ones built-in to MATLAB or ones the user compiled using GPU coder.
MATLAB does not routinely perform GPU calculations by generating C++ code of the required steps and compiling the code with nvcc and executing the code on the GPU. The user can use GPU Coder to explicitly take those steps if required.
But what can be compiled to execute on the GPU is limited to calculations (such as mathematical operations), together with "copy in to GPU" and "copy out of GPU" and "notify client that the GPU is finished" kinds of operations.
In theory you could probably memory map physical I/O ports to control hardware registers on devices such as parallel ports, but memory mapped I/O requires fairly high kernel ring operation privileges, and doing so would be considered a serious security risk for anything other than an embedded system; it is not going to happen in practice. In practice, the GPU is not going to be able to perform any I/O on its own, other than DMA to transfer data and kernels.
Memo Remo
Memo Remo on 24 Jan 2023
Awesome! I can understand it now.
Your help is greatly appreciated.

Sign in to comment.


Joss Knight
Joss Knight on 26 Jan 2023
It's probably best to read the extensive documentation available online that should give you a complete picture of how to leverage your GPU and what applications it is useful for.

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!